1 Introduction

Colorectal Cancer (CRC) is the second most common cancer in women (614,000 cases per year) and the third most common in men(746,000 cases per year). Currently it is the most common malignant cancer in the gatrointestinal tract, representing 13% of all malignant tumors, and it is considered the main cause of death in gastrointestinal cancer Rebecca L. Siegel MPH (2015). It can be arised from one or a combination of three differen mechanisms, namely chromosmomal inestability (CIN), CpG islands methylator phenotype (CIMP), and microsatellite inestability (MSI). Understanding the specific mechanisms of tumorigenesis and the underlaying genetic and epigenetic traits is crutial in the disease phenotype comprehension. The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here we analyze the expression profiles of those patients, accessible in the form of a raw RNA-seq counts produced by Rahman et al. (2015) using a pipeline based on the R/Bioconductor software package Rsubread.

2 Quality assessment

2.1 Data import

We start importing the raw table of counts.

coadse <- readRDS(file.path("rawCounts", "seCOAD.rds"))
summary(coadse$type)
normal  tumor 
    41    483 
dim(colData(coadse))
[1] 524 549

In this TCGA RNA-seq data we have a total of 524 samples (483 tumor samples and 41 non-tumor samples) that have 549 clinical variables/factors that could be analysed. However, we are not going to explore all these variables, but those that can have more relevance in the data of study.

In order to know more about the data, we explore the colData column, that corresponds to clinical variables, and their corresponding metadata.

colData(coadse)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
                                 type                     bcr_patient_uuid
                             <factor>                             <factor>
TCGA.3L.AA1B.01A.11R.A37K.07    tumor A94E1279-A975-480A-93E9-7B1FF05CBCBF
TCGA.4N.A93T.01A.11R.A37K.07    tumor 92554413-9EBC-4354-8E1B-9682F3A031D9
TCGA.4T.AA8H.01A.11R.A41B.07    tumor A5E14ADD-1552-4606-9FFE-3A03BCF76640
TCGA.5M.AAT4.01A.11R.A41B.07    tumor                                   NA
TCGA.5M.AAT5.01A.21R.A41B.07    tumor                                   NA
                             bcr_patient_barcode form_completion_date
                                        <factor>             <factor>
TCGA.3L.AA1B.01A.11R.A37K.07        TCGA-3L-AA1B            2014-4-22
TCGA.4N.A93T.01A.11R.A37K.07        TCGA-4N-A93T            2014-10-1
TCGA.4T.AA8H.01A.11R.A41B.07        TCGA-4T-AA8H             2014-6-5
TCGA.5M.AAT4.01A.11R.A41B.07                  NA                   NA
TCGA.5M.AAT5.01A.21R.A41B.07                  NA                   NA
                             prospective_collection
                                           <factor>
TCGA.3L.AA1B.01A.11R.A37K.07                    YES
TCGA.4N.A93T.01A.11R.A37K.07                    YES
TCGA.4T.AA8H.01A.11R.A41B.07                     NO
TCGA.5M.AAT4.01A.11R.A41B.07                     NA
TCGA.5M.AAT5.01A.21R.A41B.07                     NA
mcols(colData(coadse), use.names=TRUE)
DataFrame with 549 rows and 2 columns
                                                         labelDescription
                                                              <character>
type                                           sample type (tumor/normal)
bcr_patient_uuid                                         bcr patient uuid
bcr_patient_barcode                                   bcr patient barcode
form_completion_date                                 form completion date
prospective_collection            tissue prospective collection indicator
...                                                                   ...
lymph_nodes_pelvic_pos_total                               total pelv lnp
lymph_nodes_aortic_examined_count                           total aor lnr
lymph_nodes_aortic_pos_by_he                          aln pos light micro
lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
lymph_nodes_aortic_pos_total                                total aor-lnp
                                        CDEID
                                  <character>
type                                       NA
bcr_patient_uuid                           NA
bcr_patient_barcode                   2673794
form_completion_date                       NA
prospective_collection                3088492
...                                       ...
lymph_nodes_pelvic_pos_total          3151828
lymph_nodes_aortic_examined_count     3104460
lymph_nodes_aortic_pos_by_he          3151832
lymph_nodes_aortic_pos_by_ihc         3151831
lymph_nodes_aortic_pos_total          3151827

These metadata consists of two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.

Now, explore the row (feature) data.

rowData(coadse)
DataFrame with 20115 rows and 3 columns
               symbol     txlen              txgc
          <character> <integer>         <numeric>
1                A1BG      3322 0.564419024683925
2                 A2M      4844 0.488232865400495
9                NAT1      2280 0.394298245614035
10               NAT2      1322 0.389561270801815
12           SERPINA3      3067 0.524942940984676
...               ...       ...               ...
100996331       POTEB      1706 0.430832356389215
101340251    SNORD124       104 0.490384615384615
101340252   SNORD121B        81 0.407407407407407
102724473      GAGE10       538 0.505576208178439
103091865   BRWD1-IT2      1028 0.592412451361868
rowRanges(coadse)
GRanges object with 20115 ranges and 3 metadata columns:
            seqnames            ranges strand |      symbol     txlen
               <Rle>         <IRanges>  <Rle> | <character> <integer>
          1    chr19 58345178-58362751      - |        A1BG      3322
          2    chr12   9067664-9116229      - |         A2M      4844
          9     chr8 18170477-18223689      + |        NAT1      2280
         10     chr8 18391245-18401218      + |        NAT2      1322
         12    chr14 94592058-94624646      + |    SERPINA3      3067
        ...      ...               ...    ... .         ...       ...
  100996331    chr15 20835372-21877298      - |       POTEB      1706
  101340251    chr17 40027542-40027645      - |    SNORD124       104
  101340252     chr9 33934296-33934376      - |   SNORD121B        81
  102724473     chrX 49303669-49319844      + |      GAGE10       538
  103091865    chr21 39313935-39314962      + |   BRWD1-IT2      1028
                         txgc
                    <numeric>
          1 0.564419024683925
          2 0.488232865400495
          9 0.394298245614035
         10 0.389561270801815
         12 0.524942940984676
        ...               ...
  100996331 0.430832356389215
  101340251 0.490384615384615
  101340252 0.407407407407407
  102724473 0.505576208178439
  103091865 0.592412451361868
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

By using the rowData and rowRanges functions we can see information about the features of interest. In this case each row represents a gene transcript and we can see information about the length, the GC content, the chromosome where it is located, the strand (forward or reverse), etc.

To perform quality assessment and normalization we need first to load the edgeR R/Bioconductor package, create a DGEList object and store it in the results folder.

dge <- DGEList(counts=assays(coadse)$counts, genes=mcols(coadse))
saveRDS(dge, file.path("results", "dge.rds"))

Since gene expression data contain a big amount of variability, it is a good practice to stabilize variability by transforming the expression values into logarithmic scale. In order to do so, we calculate \(\log_2\) CPM values of expression and put them as an additional assay element to ease their manipulation.

assays(coadse)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(coadse)$logCPM[1:5, 1:5]
   TCGA.3L.AA1B.01A.11R.A37K.07 TCGA.4N.A93T.01A.11R.A37K.07
1                     0.2611757                     3.518905
2                    10.0130074                     6.668572
9                    -5.8591298                    -5.859130
10                   -5.8591298                    -5.859130
12                    5.7145152                     5.917077
   TCGA.4T.AA8H.01A.11R.A41B.07 TCGA.5M.AAT4.01A.11R.A41B.07
1                     -0.593583                    -1.566157
2                      6.471984                     7.316527
9                     -5.859130                    -5.859130
10                    -5.859130                    -5.859130
12                     5.505588                     5.475205
   TCGA.5M.AAT5.01A.21R.A41B.07
1                   -0.09524941
2                    7.24810042
9                   -5.85912981
10                  -4.41522112
12                   5.44298099

As we can see, for each sample there are different CPM (Counts Per Million) values of expression. Negative ones correspond to very low expressed samples.

2.2 Subset of samples: Paired Analysis

From the total of 524 samples, before starting to analyse the data we proceed to do a subset of paired samples (tumor and non-tumor samples extracted from the same patient). The reason of doing this is to reduce the within variance, as far as variation between different individuals could be found due to environmental factors. In order to obtain a better estimate of these expression level differences between Tumor and Non-tumor samples, we searched in the TCGA barcode those patients that were present in both Tumor and Non-tumor samples. This meant that those samples belonged to the same patient and, thus, were paired.

df <-  data.frame(rbind(table(coadse$bcr_patient_barcode, coadse$type)))
df_paired<-df[(df$normal > 0) & (df$tumor > 0),]
paired_mask <-coadse$bcr_patient_barcode  %in% rownames(df_paired)
coadse.paired<- coadse[,paired_mask]
dge.paired <- dge[,paired_mask]
table(coadse.paired$type)

normal  tumor 
    38     38 

As is shown in this table, we finally found a total of 38 non-tumor samples and 38 tumor paired samples. However, we consider that with this number of samples should be sufficient powered to detect expression changes between tumor and non-tumor samples.

2.3 Sequencing depth

Before proceeding with any normalization or analysis step, we want to collect an overview of the data that we are working with. Here we examine the sequencing depth by plotting the total number of reads mapped to the genome per sample.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

In the plot 1 we observe two main aspects that are worth paying attention to. First, we observe that in general the tumor samples present a lower sequencing depth than the normal samples. Even though we could not find any reliable explaination for this behaviour so far in our research, we considered this event dispensable at this time as the sequencing depth will be taken into consideration during the normalization steps. One other important aspect is that we observe that some sample present considerable differences in the sequencing depth and specifically we can distinguish a couple of samples on the left of the graph that present considerably low sequencing depths: this may generate problems in further analysis of the data. For this reason we identify these samples by looking at the at the sample depth as shown below.

sampledepth <- round(dge.paired$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(coadse.paired), 6, 12)
sort(sampledepth)
A6.2679 A6.2682 AA.3531 AA.3517 AA.3534 A6.2683 AA.3518 AA.3520 AA.3522 
    5.9     9.0    13.6    14.7    15.4    15.8    16.5    16.8    17.0 
AA.3516 AA.3525 AA.3514 A6.2680 AA.3527 A6.2685 A6.2671 A6.2678 AA.3496 
   17.2    17.5    17.7    18.1    19.0    19.2    21.1    21.1    25.3 
AA.3697 AA.3660 AZ.6598 AA.3663 A6.2675 AA.3663 AZ.6600 AA.3662 AA.3662 
   25.6    28.4    30.2    30.4    31.0    31.1    31.2    31.5    32.5 
AA.3713 A6.5662 A6.2675 AA.3655 AA.3660 AZ.6599 AA.3712 AZ.6605 AA.3489 
   32.7    33.0    33.4    33.5    33.7    34.2    34.5    34.7    34.7 
AZ.6601 AZ.6603 AA.3525 AA.3655 A6.5667 AA.3534 A6.2686 AA.3511 A6.5667 
   35.1    35.3    36.8    37.2    37.3    37.7    37.8    37.8    38.2 
AZ.6601 A6.2679 AZ.6598 A6.5662 AA.3496 AA.3511 AA.3712 AZ.6605 AA.3518 
   38.6    39.1    39.2    39.5    39.8    40.3    40.4    40.7    40.8 
AZ.6599 AZ.6600 AA.3713 AA.3697 AA.3514 AZ.6603 AA.3489 A6.2682 A6.2678 
   40.9    41.9    42.0    44.4    44.9    45.0    45.2    45.8    46.5 
A6.2671 F4.6704 AA.3516 A6.2685 A6.2683 A6.2686 A6.2680 AA.3531 AA.3522 
   47.3    47.3    47.7    48.6    48.8    49.9    53.3    53.8    55.6 
AA.3517 AA.3527 AA.3520 F4.6704 
   56.0    57.1    57.8    67.5 

Because the samples A6.2679 and A6.2682 present extremely low sequencing depths we decide to remove them from our dataset. Since we decided to proceed with a paired analysis, when removing one sample we remove its paired one too.

# Remove the two samples with very low sequencing depth
mask_remove_low_coverage <- substr(colnames(coadse.paired),9,12) %in% c("2679", "2682")
coadse.paired <- coadse.paired[,!mask_remove_low_coverage]
dge.paired <- dge.paired[,!mask_remove_low_coverage]
summary(coadse.paired$type)
normal  tumor 
    36     36 

We verify that after the filtering of the samples with low library size, our paired dataset contains 36 samples.

# coverage_filtered plot
par(mfrow = c(1, 1), mar = c(4, 5, 1, 1))
ord <- order(dge.paired$sample$lib.size)
barplot(dge.paired$sample$lib.size[ord]/1e+06, las = 1, ylab = "Millions of reads", xlab = "Samples",
        col = c("cyan", "orange")[coadse.paired$type[ord]] )
legend("topleft", c("Normal", "Tumor"), fill = c("cyan", "orange"), inset = 0.01)
Library sizes in increasing order after filtering

Figure 2: Library sizes in increasing order after filtering

Moreover, in figure 2 we can observe again the sorted histogram of library sizes per sample after the filtering.

2.4 Gender among samples

In order to collect an even broader understanding of the dataset we then furher explore the sequencing depth highlighting female and male samples.

ord <- order(dge.paired$sample$lib.size)
barplot(dge.paired$sample$lib.size[ord]/1e+06, las = 1, ylab = "Millions of reads", xlab = "Samples", col = c("cyan", "orange")[coadse.paired$gender[ord]] )
legend("topleft", c("Female", "Male"), fill = c("cyan", "orange"), inset = 0.01)
Library sizes with respect to gender

Figure 3: Library sizes with respect to gender

In figure 3 we can see that not only we have a similar number of female and male samples, but that their sequencing depth is quite well equilibrated as we cannot clearly identify any defined clustering event.

2.5 Distribution of expression levels among samples

Even after the filtering out of the samples with low library sizes values, some samples still present different sequencing depth and also there may be sample-specific biase related to sample preparation. For this reason, we need to carry out a normalization procedure.

assays(coadse.paired)$logCPM <- cpm(dge.paired, log = TRUE, prior.count = 0.25) 

We are then interested in exploring the distribution of the expression levels through all the samples in terms of logarithmic CPM units.

Non-parametric density distribution of expression profiles per sample.

Figure 4: Non-parametric density distribution of expression profiles per sample

In figure 4 we can observe a non-parametric density distribution of expression profiles per sample in terms of logarithmic units. As expected the expression values per sample follow a bimodal distribution with an early peak that corresponds to lowly-expressed genes and a later peak that correponds to highly-expressed ones. In order to provide a clearer visualization, we decided to divide the dataset into two smaller subsets, specifically tumor samples and normal samples.

Non-parametric density distribution of expression profiles per sample.

Figure 5: Non-parametric density distribution of expression profiles per sample

Figure 5 shows the expression levels in terms of logCPM for tumors and control samples separately. We can not appreciate any substancial difference between the distributions of the tumor and normal samples.

2.6 Distribution of expression levels among genes

We are now interested in investigating if there is any gene which has very low expression values so that we can exclude them from our dataset. In fact, if a gene is not expressed, then it can not be differentially expressed and so including them in our analysis would only add noise.

Distribution of average expression level per gene.

Figure 6: Distribution of average expression level per gene

In order to do so, we calculate the average expression of each gene for all the samples and plot their distribution in Figure 6.

2.7 Filtering of lowly-expressed genes

From the distribution of figure 6 we could already visually identify a cutoff to define the too lowly expressed genes. Nontheless, we opted for a more precise approach to calculate the cutoff.

cpmcutoff <- round(10/min(dge.paired$sample$lib.size/1e+06), digits = 1)
sprintf("Cutoff: %s", cpmcutoff)
[1] "Cutoff: 0.7"
nsamplescutoff <- min(table(coadse.paired$gender))
mask <- rowSums(cpm(dge.paired) > cpmcutoff) >= nsamplescutoff
coadse.filt <- coadse.paired[mask, ]
dge.filt <- dge.paired[mask, ]

Once we have identified the cutoff, we filter out the genes that are below it.

par(mar = c(4, 5, 1, 1))
h <- hist(avgexp, xlab = expression("Expression level (" * log[2] * "CPM)"), main = "",
          las = 1, col = "grey", cex.axis = 1.2, cex.lab = 1.5)
x <- cut(rowMeans(assays(coadse.filt)$logCPM), breaks = h$breaks)
lines(h$mids, table(x), type = "h", lwd = 10, lend = 1, col = "darkred")
legend("topright", c("All genes", "Filtered genes"), fill = c("grey", "darkred"))
Distribution of average expression level per gene highlighting the filtering

Figure 7: Distribution of average expression level per gene highlighting the filtering

We can visually observe which genes have been left out from the datatset in Figure 7. Before proceeding with the normalization steps, we prefer to save a copy of the unnormalized data in case we later need to work on it again.

saveRDS(coadse.filt, file.path("results", "coadse.filt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.filt.unnorm.rds"))

2.8 Normalization

Since different samples may have different RNA compositions and this could be problematic for further analyses we need to take it into account and normalize the data. We estimate a normalization factor for each library using the Trimmed Mean of M-Values and apply it to our data.

dge.filt <- calcNormFactors(dge.filt) 
assays(coadse.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)

Again, we like to keep on with the good practice of storing intermediate generated files.

saveRDS(coadse.filt, file.path("results", "coadse.filt.rds"))
saveRDS(dge.filt, file.path("results", "dge.filt.rds"))

2.9 MA-plots

We now want to visualize the expression profiles of the normalized data. As we did previously, we opted for a subdivision of the dataset in the tumor and normal subsets.
MA-plots of the tumor samples.

Figure 8: MA-plots of the tumor samples

In Figure 8 we observe the MA-plots for the Tumor samples subset. We can see that even after the between and within normalization steps, we still have some artifacts in our data as we can observe for example in 8 B, E, F, J, Z, AI, and AD.

MA-plots of the control samples.

Figure 9: MA-plots of the control samples

In Figure 9 we observe the MA-plots for the Control samples subset and we cannot identify any important expression levels biases in any of the normal samples.

2.10 Batch identification

The next step will be the search for potential surrogate of batch effect indicators. Batch effect include sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological variables in the study. As normalization of the data can not always remove completely the batch effect, we are going to identify this effect, adjust for it and eventually correct it.

Given that each sample names corresponds to a [TCGA barcode] (https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(colnames(dge.filt), 6, 7)
table(data.frame(TYPE=coadse.filt$type, TSS=tss))
        TSS
TYPE     A6 AA AZ F4
  normal  9 20  6  1
  tumor   9 20  6  1
samplevial <- substr(colnames(dge.filt), 14, 16)
table(data.frame(TYPE=coadse.filt$type, SAMPLEVIAL=samplevial))
        SAMPLEVIAL
TYPE     01A 11A
  normal   0  36
  tumor   36   0
portionanalyte <- substr(colnames(dge.filt), 18, 20)
table(data.frame(TYPE=coadse.filt$type, PORTIONANALYTE=portionanalyte))
        PORTIONANALYTE
TYPE     01R 02R 11R 21R
  normal  35   1   0   0
  tumor   17   6   7   6
plate <- substr(colnames(dge.filt), 22, 25)
table(data.frame(TYPE=coadse.filt$type, PLATE=plate))
        PLATE
TYPE     0821 0826 1410 1653 1672 1723 1774 1839 A32Z
  normal    0    0    0    1    1    9    4    6   15
  tumor     9    3    3    1    0    9    4    6    1
center <- substr(colnames(dge.filt), 27, 28)
table(data.frame(TYPE=coadse.filt$type, CENTER=center))
        CENTER
TYPE     07
  normal 36
  tumor  36

From this information we can make the following observations:

  • TSS: Samples were collected across 4 different tissue source sites (TSS), with poor representation of one of them (F4).

  • Sample Vial: 36 samples belong to solid tumor samples(01 code) whereas 36 do not (11 code).

  • Portions & Analyte: samples where distributed in different portions and analytes combinations.

  • Plates: Samples were sequenced in 9 different plates.

  • Center: All samples were sequenced at the same center.

  • The analyte code of all samples is a R (RNA samples), as expected.

Since all the sample were analysed in the same center, that is for sure not a cause of batch effect. Moreover, the cross-tabulation tables of sample vial, portion analytes and plates contain too many zero entries which implies that we do not have enough information identify any batch effect with this information. On the other hand, TSS presents values that allow further investigation to detect batch effect.

logCPM <- cpm(dge.paired, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(tss))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(coadse.filt)
outcome <- paste(substr(colnames(coadse.filt), 9, 12), as.character(coadse.filt$type), sep="-")
names(outcome) <- colnames(coadse.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(tss))), fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 10: Hierarchical clustering of the samples

We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the the surrogate of batch indicator. We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 10.

We can observe that samples cluster primarily by sample type, tumor or normal, which is a good sign for our analysis since normal and tumor are our main outcomes of interest.

plotMDS(dge.paired, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(tss))),
       fill=sort(unique(batch)), inset=0.05)
Multidimensional scaling plot of the samples.

Figure 11: Multidimensional scaling plot of the samples

In Figure 11 we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples.

From both the hierarchical plot and de MDS plot we can not identify any batch effect for the TSS. For this reason, we proceed the analysis without adjusting or removing any batch effect.

2.11 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] geneplotter_1.60.0          annotate_1.60.1            
 [3] XML_3.98-1.19               AnnotationDbi_1.44.0       
 [5] lattice_0.20-38             edgeR_3.24.3               
 [7] limma_3.38.3                SummarizedExperiment_1.12.0
 [9] DelayedArray_0.8.0          BiocParallel_1.16.6        
[11] matrixStats_0.54.0          Biobase_2.42.0             
[13] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[15] IRanges_2.16.0              S4Vectors_0.20.1           
[17] BiocGenerics_0.28.0         knitr_1.22                 
[19] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             RColorBrewer_1.1-2     compiler_3.5.3        
 [4] BiocManager_1.30.4     highr_0.8              XVector_0.22.0        
 [7] bitops_1.0-6           tools_3.5.3            zlibbioc_1.28.0       
[10] bit_1.1-14             digest_0.6.18          memoise_1.1.0         
[13] RSQLite_2.1.1          evaluate_0.13          Matrix_1.2-17         
[16] DBI_1.0.0              yaml_2.2.0             xfun_0.6              
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0          bit64_0.9-7           
[22] locfit_1.5-9.1         grid_3.5.3             rmarkdown_1.12        
[25] bookdown_0.9           blob_1.1.1             magrittr_1.5          
[28] codetools_0.2-16       htmltools_0.3.6        xtable_1.8-4          
[31] KernSmooth_2.23-15     stringi_1.4.3          RCurl_1.95-4.12       

3 Differential expression

After the data filtering and normalization, we are interested in identifying gene expression changes and associated p-values between tumor and control samples. For this task, we used the R/Bioconductor packages sva and limma. We are adopting linear regression models, which in general are used in order to represent, in the most accurate way possible, something which is very complex. In the specific case of DE analysis, the model is used as a predictor of gene expression values. For the purpose of fitting the said regression model, it is crucial to build a design matrix. The design matrix is a n x m matrix, where n is the number of samples and m is the number of coeffients which need to be estimated or the types of samples, like cases and controls. It needs to be defined for which factors to adjust. In our case we are working with a paired design and we would have liked to adjust for gender.

We decide to proceed by only adjusting for the patient id, as it is already implicitly adjusting for gender; as we are working with a paired design, gender is identical between every pair of samples, which would make reduntand to adjust for it and would create technical problems during the inversion step of the matrix in the least square algorithm steps.

In the follwing step we create the design matrix, only adjusting for the patient id. We verify as well wether the matrix is full rank and so is invertible with no problem, which is the case.

pid <- substr(colnames(dge.filt), 9, 12)
mod <- model.matrix(~ factor(type) + factor(pid), colData(coadse.filt))
mod0 <- model.matrix(~ factor(pid), colData(coadse.filt))
#Verify wether is full rank --> yes!
is.fullrank(mod)
[1] TRUE

We use the sva package for identifying and removing batch effects and other unwanted sources of variation.

sv <- sva(assays(coadse.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is:  8 
Iteration (out of 5 ):1  2  3  4  5  
sv$n
[1] 8

The SVA algorithm has found 8 surrogate variables which we add to our model.

modsv <- cbind(mod, sv$sv)
mod0sv <- cbind(mod0, sv$sv)

In order to compare gene expression values between samples we need to adjust for the mean-variance realtionship.

barplot(sort(dge.filt$samples$lib.size)/1e+06, xlab = "Samples", ylab = "Library sizes (Millions)", col=rgb(0.7, 0.1, 0.3))
Library sizes among samples

Figure 12: Library sizes among samples

This step could be executed with both limma trend and limma voom. We decided to proceed with the voom() function, since as we can observe in the Figure 12, there are big differences in library sizes among the samples.

So, we calculate the weights that estimate the mean-variance realtionship at gene-by-sample level with the voom() function. The obtained weights will be used to fit the model. Morevoer, at the same time, it performs a between-sample normalization.

v <- voom(dge.filt, modsv, plot=TRUE, normalize.method="quantile")
Mean-variance trend

Figure 13: Mean-variance trend

Next, we fit a linear model with the fit() function.

# Fit linear model for each gene given a series of arrays
fit<- lmFit(v,modsv)

We then calculate the moderated t-statistics through the eBayes() function.

# compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a common value.
fit <- eBayes(fit)
FDRcutoff <- 0.01

Next, we classified the obtained test statistics as up, down or not significant with the help of the decideTest function. The FDR threshold used for this test is corresponds to 1%, as in cancer data is usual to have a lot of changes and so we prefer to be strict.

#Classify a series of related t-statistics as up, down or not significant. 
res <- decideTests(fit, p.value = FDRcutoff) 
summary(res)
       (Intercept) factor(type)tumor factor(pid)2675 factor(pid)2678
Down            97              4224               1               3
NotSig        1760              3684           12965           12963
Up           11110              5059               1               1
       factor(pid)2680 factor(pid)2683 factor(pid)2685 factor(pid)2686
Down                 1               1               3               1
NotSig           12966           12965           12963           12966
Up                   0               1               1               0
       factor(pid)3489 factor(pid)3496 factor(pid)3511 factor(pid)3514
Down                 5               1               0               2
NotSig           12959           12964           12967           12962
Up                   3               2               0               3
       factor(pid)3516 factor(pid)3517 factor(pid)3518 factor(pid)3520
Down                 0               0               0               1
NotSig           12966           12967           12967           12965
Up                   1               0               0               1
       factor(pid)3522 factor(pid)3525 factor(pid)3527 factor(pid)3531
Down                 0               0               1               1
NotSig           12965           12965           12966           12965
Up                   2               2               0               1
       factor(pid)3534 factor(pid)3655 factor(pid)3660 factor(pid)3662
Down                 0               0               3               2
NotSig           12966           12965           12960           12962
Up                   1               2               4               3
       factor(pid)3663 factor(pid)3697 factor(pid)3712 factor(pid)3713
Down                 3               1               0               1
NotSig           12964           12966           12966           12964
Up                   0               0               1               2
       factor(pid)5662 factor(pid)5667 factor(pid)6598 factor(pid)6599
Down                 4               1               0               1
NotSig           12962           12966           12967           12966
Up                   1               0               0               0
       factor(pid)6600 factor(pid)6601 factor(pid)6603 factor(pid)6605
Down                 2               1               5               0
NotSig           12965           12965           12962           12967
Up                   0               1               0               0
       factor(pid)6704                                                
Down                 1  1928  1251  1430   420  1085    95     7     3
NotSig           12966  9327 10863 10207 12144 10780 12742 12959 12964
Up                   0  1712   853  1330   403  1102   130     1     0

We extract the genes chromosome to look at the distribution of the differential expressed genes among the chromosomes, again for the purpose of getting an overview of the data we will be working with.

genesmd <- data.frame(chr = as.character(sub("\\_.*", "", seqnames(rowRanges(coadse.filt)))), symbol = rowData(coadse.filt)[, 1], stringsAsFactors = FALSE)
#Extract a table of the top-ranked genes from a linear model fit.
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
genes<-table(as.character(sub("\\chr*", "",tt$chr[tt$adj.P.Val < FDRcutoff])))
par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
names.arg=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y","Un")
barplot(genes,  cex.names=0.7, names.arg = names.arg, col=rgb(0.7, 0.1, 0.3))
Distribution of DE genes among chromosomes

Figure 14: Distribution of DE genes among chromosomes

sort(table(tt$chr[tt$adj.P.Val < FDRcutoff]), decreasing = TRUE)

 chr1 chr19  chr2 chr11  chr3 chr17 chr12  chr6  chr7  chr5  chr9  chrX 
  948   676   613   612   534   529   489   430   424   420   405   362 
chr16  chr4 chr10 chr14 chr15  chr8 chr20 chr22 chr13 chr18 chr21  chrY 
  357   347   344   341   276   276   263   221   162   124   114    15 
chrUn 
    1 

Now, we produce two diagnostic plots for limma DE analysis.

par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt$P.Value, xlab = "Raw P-values", main = "A)", las = 1, col=rgb(0.7, 0.1, 0.3))
qqt(fit$t[, 2], df = fit$df.prior + fit$df.residual, main = "B)", pch = ".", cex = 3) 
abline(0, 1, lwd = 2, col= "red")
Diagnostic plots for limma DE analysis

Figure 15: Diagnostic plots for limma DE analysis

In the A plot of figure 15 we visualize the raw p-values distribution: it is mainly uniform a part from the very high peak on the left which corresponds to the significantly DE genes. In the B Q-Q plot of figure 15 we observe that our gene expression values are not following a normal distribution. In fact, if that was the case, we would expect the two lines overlapping and no DE genes.

One interesting aspect we would like to systematically assess is the accuracy of the model with different combinations of adjusting factors and when using different algorithms when adjusting for the mean variance relationship. For doing this, we would need to use a dataset of genes which are known to be differentially expressed between patients with colon cancer and control individuals with no colon cancer. At the moment, we could not find a suitable dataset for this task but in the future, to expand the analysis, it would be interesting to investigate this aspect further. Lastly, we just collect our set of differential expressed genes to proceed with further analyses.

We proceed to filter only those genes with FDR < 1% and a logFC > 2 (a minimum 100% change in expression). We then separately observe the upregulated genes in tumor (logFC>0) and the downregulated ones (logFC<0). We decide to investigate them by sorting them by logFC.

tt.filt <- tt[tt$adj.P.Val < FDRcutoff,]  #filtering by FDR < 0.01
tt.filt <- tt.filt[abs(tt.filt$logFC)>2,] # filtering by logFC > 2
all_DEgenes <- tt.filt
up_regulated <- all_DEgenes[which(all_DEgenes$logFC > 0),] # DE genes up regulated
down_regulated <- all_DEgenes[which(all_DEgenes$logFC < 0),] # DE genes down regulated
length(rownames(all_DEgenes)) # how many DE genes we have
[1] 895
length(rownames(up_regulated)) #  Up-reg DE
[1] 371
length(rownames(down_regulated)) # Down-reg DE
[1] 524
down_regulated <- down_regulated[order(down_regulated$logFC),]
up_regulated <- up_regulated[order(up_regulated$logFC, decreasing = TRUE),]
print(head(down_regulated$symbol, n = 8))
[1] "PPP1R16B" "CA11"     "LVRN"     "CLCN1"    "WNT10A"   "HDAC2"   
[7] "OPRK1"    "SCPEP1"  
print(head(up_regulated$symbol, n = 8))
[1] "NT5DC2"  "COL13A1" "COPA"    "CDH6"    "FZD9"    "LCE2A"   "ESRP2"  
[8] "DPH1"   

After the filtering, the total number of DE genes is 895, from which 371 are up regulated and 524 are down regulated.

tt.filt <- tt[tt$adj.P.Val < FDRcutoff,]  #filtering by FDR < 0.01
all_DEgenes <- tt.filt[abs(tt.filt$logFC)>2,] # filtering by logFC > 2
DEgenes <- rownames(all_DEgenes) 
limma::plotMA(fit, coef = 2, status = rownames(fit$lods) %in% DEgenes, legend = FALSE,
main = "MA plot", hl.pch = 46, hl.cex = 2, bg.pch = 46, bg.cex = 2, las = 1)
legend("bottomright", c("DE genes", "Not DE genes"), fill = c("red", "black"), inset = 0.01)
MA plot of DE genes with logFC > 2 and FDR cutoff < 0.01.

Figure 16: MA plot of DE genes with logFC > 2 and FDR cutoff < 0
01.

saveRDS(all_DEgenes , file.path("results", "alldegenes.rds"))

In figure 16 we can see the distribution of the filtered DE genes. One reassuring thing here is that the dots are centered around the value zero which indicates that there are not expression-level dependent biases.

3.1 Volcano plot

Lastly, we want to visualize the DE genes in a Volcano plot. We set two different thresholds: the log fold change threshold and the adjusted p-value threshold.

EnhancedVolcano(tt,
    lab = tt$symbol,
    x = 'logFC',
    y = 'adj.P.Val',
    xlim = c(-8, 8),
    title = 'Volcano Plot',
    pCutoff = 10e-16,
    FCcutoff = 2,
    transcriptPointSize = 1.0,
    transcriptLabSize = 3.0)
Warning: Removed 1 rows containing missing values (geom_point).
Warning: Removed 1 rows containing missing values (geom_text).
Volcano plot

Figure 17: Volcano plot

3.2 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GSVA_1.30.0                 GSVAdata_1.18.0            
 [3] hgu95a.db_3.2.3             GSEABase_1.44.0            
 [5] org.Hs.eg.db_3.7.0          xtable_1.8-4               
 [7] GOstats_2.48.0              graph_1.60.0               
 [9] Category_2.48.1             Matrix_1.2-17              
[11] EnhancedVolcano_1.0.1       ggrepel_0.8.1              
[13] ggplot2_3.1.1               calibrate_1.7.2            
[15] MASS_7.3-51.4               sva_3.30.1                 
[17] genefilter_1.64.0           mgcv_1.8-28                
[19] nlme_3.1-139                geneplotter_1.60.0         
[21] annotate_1.60.1             XML_3.98-1.19              
[23] AnnotationDbi_1.44.0        lattice_0.20-38            
[25] edgeR_3.24.3                limma_3.38.3               
[27] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[29] BiocParallel_1.16.6         matrixStats_0.54.0         
[31] Biobase_2.42.0              GenomicRanges_1.34.0       
[33] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[35] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[37] knitr_1.22                  BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2    
 [4] Rgraphviz_2.26.0       tools_3.5.3            R6_2.4.0              
 [7] DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1      
[10] withr_2.1.2            tidyselect_0.2.5       bit_1.1-14            
[13] compiler_3.5.3         labeling_0.3           bookdown_0.9          
[16] scales_1.0.0           RBGL_1.58.2            stringr_1.4.0         
[19] digest_0.6.18          rmarkdown_1.12         AnnotationForge_1.24.0
[22] XVector_0.22.0         pkgconfig_2.0.2        htmltools_0.3.6       
[25] highr_0.8              rlang_0.3.4            RSQLite_2.1.1         
[28] shiny_1.3.2            dplyr_0.8.1            RCurl_1.95-4.12       
[31] magrittr_1.5           GO.db_3.7.0            GenomeInfoDbData_1.2.0
[34] Rcpp_1.0.1             munsell_0.5.0          stringi_1.4.3         
[37] yaml_2.2.0             zlibbioc_1.28.0        plyr_1.8.4            
[40] grid_3.5.3             blob_1.1.1             promises_1.0.1        
[43] crayon_1.3.4           splines_3.5.3          locfit_1.5-9.1        
[46] pillar_1.3.1           codetools_0.2-16       glue_1.3.1            
[49] evaluate_0.13          BiocManager_1.30.4     httpuv_1.5.1          
[52] gtable_0.3.0           purrr_0.3.2            assertthat_0.2.1      
[55] xfun_0.6               mime_0.6               later_0.8.0           
[58] survival_2.44-1.1      tibble_2.1.1           shinythemes_1.1.2     
[61] memoise_1.1.0         

4 GSEA

In many differential gene expression analyses, the magnitude of gene expression changes may be small and very few significant DE genes will be idenified after having adjusted for multiple testing. For this reason, we want to see if there are small but consistent changes occuring for a number of genes operating in a common pathway. We go through this workflow by assessing DE genes direclty at the geneset level, starting from the expression data themselves. We approach this analysis by using the GSEA algorithm.

4.1 Preparation

To start with the GSEA, we start putting our expression data and the collection of gene sets from the GSVA[http://www.bioconductor.org/packages/release/data/experiment/html/GSVAdata.html] dataset c2BroadSets in a GeneSetCollection object, as it contains curated gene sets, suggesting a good set representation for biological data analysis.

# collect gene sets
data(c2BroadSets)
gsc <- GeneSetCollection(c2BroadSets)
gsc
GeneSetCollection
  names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
  unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)
length(gsc) # number of gene sets in the collection
[1] 3272
head(names(gsc)) # some of the gene sets in the collection
[1] "NAKAMURA_CANCER_MICROENVIRONMENT_UP" 
[2] "NAKAMURA_CANCER_MICROENVIRONMENT_DN" 
[3] "WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP"
[4] "WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN"
[5] "WINTER_HYPOXIA_UP"                   
[6] "WINTER_HYPOXIA_DN"                   

In order to reduce the amount of data to analyse, we are going to restrict the analysis to the pathways KEGG, REACTOME and BIOCARTA.

gsc <- gsc[c(grep("^KEGG", names(gsc)),
grep("^REACTOME", names(gsc)), grep("^BIOCARTA", names(gsc)))]
gsc
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
  unique identifiers: 55902, 2645, ..., 8544 (6744 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)
length(gsc)
[1] 833

Now we can start our GSEA analysis, using the algorithm refered as simple GSEA (Irizarry et al. (2009)[https://journals.sagepub.com/doi/abs/10.1177/0962280209351908]).

First, we need to map the identifiers from the gene sets to the identifiers of the data we are going to analyze. Furthermore, we create an incidence matrix (Im) which indicates which genes belong to what gene set.

gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(coadse.filt)$annotation))
gsc
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
  unique identifiers: 55902, 2645, ..., 8544 (6744 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)
Im <- incidence(gsc)
dim(Im)
[1]  833 6744
Im[1:2, 1:10]
                                55902 2645 5232 5230 5162 5160 5161 55276
KEGG_GLYCOLYSIS_GLUCONEOGENESIS     1    1    1    1    1    1    1     1
KEGG_CITRATE_CYCLE_TCA_CYCLE        0    0    0    0    1    1    1     0
                                7167 84532
KEGG_GLYCOLYSIS_GLUCONEOGENESIS    1     1
KEGG_CITRATE_CYCLE_TCA_CYCLE       0     0

We can see that several genes are present in more than one gene set.

Once done it, the following step will be to discard those genes that are not present in our data, and also discard all genes in our data that are not annotated in any gene sets.

Im <- Im[, colnames(Im) %in% rownames(all_DEgenes)]
dim(Im)
[1] 833 312
coadse.filt <- coadse.filt[colnames(Im), ]
dim(coadse.filt)
[1] 312  72
dge.filt <- dge.filt[colnames(Im),]
dim(dge.filt)
[1] 312  72

As a result we have discarted almost 6400 genes that didn’t belong to our data.

4.2 Simple GSEA

In order to determine whether an a priori defined set of genes shows statistically significant differences between tumor vs normal phenotypes, we are going to perform a Gene Set Enrichment Analysis to introduce a method for pathway analysis assessing DE directly at gene set level.

The main scope of the GSEA approach is, as mentioned, to detect consistent and robust changes in expression within a gene set. To do so, we calculate the Enrichment Score (ES) as a z-score.

We decided to filter out the genesets that contain less than 10 genes. In fact, very small gene sets may induce little reliability and increase type I errors (false positives). Then we also calculate the z-score.

Im <- Im[rowSums(Im) >= 5, ]
tGSgenes <- all_DEgenes[match(colnames(Im), rownames(all_DEgenes)), "t"] # calculate all t-statistics for genes in each gene set
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im)) # calculating Zscore statistic
length(zS)
[1] 98
head(zS, n=10)
                  KEGG_GLYCOLYSIS_GLUCONEOGENESIS 
                                       -14.054563 
                       KEGG_FATTY_ACID_METABOLISM 
                                       -22.468084 
                KEGG_STEROID_HORMONE_BIOSYNTHESIS 
                                        -3.128532 
                           KEGG_PURINE_METABOLISM 
                                       -18.179222 
                       KEGG_PYRIMIDINE_METABOLISM 
                                       -39.806624 
             KEGG_ARGININE_AND_PROLINE_METABOLISM 
                                         8.772841 
                          KEGG_RETINOL_METABOLISM 
                                         3.420898 
        KEGG_PORPHYRIN_AND_CHLOROPHYLL_METABOLISM 
                                        17.655980 
KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 
                                       -16.873131 
             KEGG_DRUG_METABOLISM_CYTOCHROME_P450 
                                         3.158353 

The incidence matrix after the filtering has lost 200 gene sets approximately. As we can see, know we have 618 pathways that contain a total of 4174 genes.

qqnorm(zS)
abline(0,1)
Q-Q Plot of gene set Z-scores

Figure 18: Q-Q Plot of gene set Z-scores

As we can observe in the Q-Q plot of figure 18 we can detect that the gene sets Z-scores do not follow a normal distribution (expected but the null hypothesis of no DE gene sets), which indicates a promising number of DE gene sets.

Just to have an overview we rank the gene sets according to Z-scores, and show in decreasing order.

rnkGS_Z <- sort(abs(zS), decreasing = TRUE)
head(rnkGS_Z, n=20)
                   REACTOME_SIGNALLING_BY_NGF 
                                     43.99818 
 KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 
                                     42.76569 
                   KEGG_PYRIMIDINE_METABOLISM 
                                     39.80662 
                  KEGG_OLFACTORY_TRANSDUCTION 
                                     37.54749 
REACTOME_P75_NTR_RECEPTOR_MEDIATED_SIGNALLING 
                                     33.75918 
    REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS 
                                     33.30035 
                     REACTOME_G1_S_TRANSITION 
                                     33.27730 
               KEGG_CALCIUM_SIGNALING_PATHWAY 
                                     33.20118 
                   REACTOME_DIABETES_PATHWAYS 
                                     33.04180 
                   KEGG_WNT_SIGNALING_PATHWAY 
                                     32.83759 
 REACTOME_DOWNSTREAM_EVENTS_IN_GPCR_SIGNALING 
                                     32.35510 
                 KEGG_NOTCH_SIGNALING_PATHWAY 
                                     32.23747 
                          KEGG_TIGHT_JUNCTION 
                                     32.13952 
          REACTOME_CELL_CELL_ADHESION_SYSTEMS 
                                     32.06619 
          REACTOME_FORMATION_OF_PLATELET_PLUG 
                                     31.95290 
           KEGG_DRUG_METABOLISM_OTHER_ENZYMES 
                                     31.52333 
                  KEGG_LONG_TERM_POTENTIATION 
                                     31.51740 
                          REACTOME_HEMOSTASIS 
                                     31.50406 
          REACTOME_SIGNALING_IN_IMMUNE_SYSTEM 
                                     30.68054 
         REACTOME_OLFACTORY_SIGNALING_PATHWAY 
                                     30.49469 

Let’s define a function called plotGS to produce a scatter plot, for a given gene set, of the mean expression values comparing Tumor vs Normal phenotypes.

plotGS <- function(se, gs, pheno,dge, ...) {
l <- levels(colData(se)[, pheno])
idxSamples1 <- colData(se)[, pheno] == l[1]
idxSamples2 <- colData(se)[, pheno] == l[2]
exps1 <- rowMeans(assays(se)$logCPM[gs, idxSamples1])
exps2 <- rowMeans(assays(se)$logCPM[gs, idxSamples2])
rng <- range(c(exps1, exps2))
plot(exps1, exps2, pch = 21, col = "black", bg = "black", xlim = rng, ylim = rng,
xlab = l[1], ylab = l[2], ...)
abline(a = 0, b = 1, lwd = 2, col = "red")
nn<-dge[rownames(dge$genes) %in% gs,]$genes$symbol
text(exps1, exps2, labels = nn, cex = 0.8, pos=2)
}

We perform one sample z-test and compute the number of DE gene sets according to the adjusted p-value with and FDR of 1%.

pv <- pmin(pnorm(zS), 1 - pnorm(zS))
sum(pv < 0.05)
[1] 97
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 94
head(DEgs, n = 30)
 [1] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS"                  
 [2] "KEGG_FATTY_ACID_METABOLISM"                       
 [3] "KEGG_STEROID_HORMONE_BIOSYNTHESIS"                
 [4] "KEGG_PURINE_METABOLISM"                           
 [5] "KEGG_PYRIMIDINE_METABOLISM"                       
 [6] "KEGG_ARGININE_AND_PROLINE_METABOLISM"             
 [7] "KEGG_RETINOL_METABOLISM"                          
 [8] "KEGG_PORPHYRIN_AND_CHLOROPHYLL_METABOLISM"        
 [9] "KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450"
[10] "KEGG_DRUG_METABOLISM_CYTOCHROME_P450"             
[11] "KEGG_DRUG_METABOLISM_OTHER_ENZYMES"               
[12] "KEGG_ABC_TRANSPORTERS"                            
[13] "KEGG_MAPK_SIGNALING_PATHWAY"                      
[14] "KEGG_CALCIUM_SIGNALING_PATHWAY"                   
[15] "KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION"      
[16] "KEGG_CHEMOKINE_SIGNALING_PATHWAY"                 
[17] "KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION"     
[18] "KEGG_CELL_CYCLE"                                  
[19] "KEGG_OOCYTE_MEIOSIS"                              
[20] "KEGG_UBIQUITIN_MEDIATED_PROTEOLYSIS"              
[21] "KEGG_LYSOSOME"                                    
[22] "KEGG_ENDOCYTOSIS"                                 
[23] "KEGG_APOPTOSIS"                                   
[24] "KEGG_WNT_SIGNALING_PATHWAY"                       
[25] "KEGG_NOTCH_SIGNALING_PATHWAY"                     
[26] "KEGG_HEDGEHOG_SIGNALING_PATHWAY"                  
[27] "KEGG_TGF_BETA_SIGNALING_PATHWAY"                  
[28] "KEGG_AXON_GUIDANCE"                               
[29] "KEGG_FOCAL_ADHESION"                              
[30] "KEGG_CELL_ADHESION_MOLECULES_CAMS"                

After filtering by adjusted p value we obtained a total of 94 gene sets.

Now, we plot the mean expression values per gene for 6 gene sets of interest selected by the Z-score.

top1G <- colnames(Im)[which(Im[names(rnkGS_Z)[1], ] == 1)]
genesGS1 <- rowData(coadse.filt)[top1G,1] # to know the gene symbol
genesGS1
 [1] "ADCY5"     "PDPK1"     "RASGRF2"   "GSK3A"     "ADCYAP1R1" "HDAC2"    
 [7] "PSEN1"     "RAPGEF1"   "ITGB3BP"   "MEF2A"     "RTN4"      "PCSK6"    
top2G <- colnames(Im)[which(Im[names(rnkGS_Z)[2], ] == 1)]
genesGS2 <- rowData(coadse.filt)[top2G,1]
genesGS2
 [1] "CHRM1"     "GRIN2D"    "HTR4"      "PTAFR"     "PTGER4"    "AGTR2"    
 [7] "GALR1"     "ADCYAP1R1" "OPRK1"     "GABRA4"    "GPR50"     "TAAR8"    
[13] "GRM4"     
top10G <- colnames(Im)[which(Im[names(rnkGS_Z)[10], ] == 1)]
genesGS10 <- rowData(coadse.filt)[top10G,1]
genesGS10
 [1] "PPP3R2"  "PPP3CC"  "CHP1"    "CREBBP"  "EP300"   "BTRC"    "PPP2R5E"
 [8] "FZD6"    "FZD7"    "FZD9"    "AXIN2"   "WNT16"   "WNT10A"  "PSEN1"  
top15G <- colnames(Im)[which(Im[names(rnkGS_Z)[15], ] == 1)]
genesGS15 <- rowData(coadse.filt)[top15G,1]
genesGS15
[1] "PDPK1"   "RASGRP2" "PDGFA"   "PDGFB"   "PSAP"    "DAGLA"   "HRG"    
top37G <- colnames(Im)[which(Im[names(rnkGS_Z)[37], ] == 1)]
genesGS37 <- rowData(coadse.filt)[top37G,1]
genesGS37
 [1] "PSMC6"   "PSMB7"   "PSMD5"   "CLASP1"  "NUP85"   "CENPA"   "ITGB3BP"
 [8] "KIF20A"  "NSL1"    "CLIP1"  
par(mfrow = c(3, 2), mar= c(2,5,2,2), mai=c(0.5,0.6,0.6,1))
plotGS(coadse.filt, top1G, "type", main = paste("1)",names(rnkGS_Z)[1]),dge.filt,  cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top2G, "type", main = paste("2)",names(rnkGS_Z)[2]),dge.filt, cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top10G, "type", main = paste("3)", names(rnkGS_Z)[10]), dge.filt, cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top15G, "type", main = paste("4)",names(rnkGS_Z)[15]),dge.filt, cex.main = 2, cex.lab = 2, las = 1)
plotGS(coadse.filt, top37G, "type", main = paste("5)",names(rnkGS_Z)[37]),dge.filt, cex.main = 2, cex.lab = 2, las = 1)
Scatter Plots of mean expression values for genesets of interests

Figure 19: Scatter Plots of mean expression values for genesets of interests

We can observe that in the selected genesets the distribution of the genes is mainly downregulated in tumor cases respect to normal. In general, gene downregulation in cancer could possibly be the cause of dysregulation of important pathways, repression of apoptosis or tumor suppressors’ repression.

Now we want to investigate the specific genes that are up- and down-regulated.

In the plot 1 of figure 19, we can observe an overexpression of the MEF2A transcription factor myocyte-enhancer factor 2 known to play a role in adaptive responses during development and adult life. Even if the specific role in tumorgenesis of the MEF2 family has not been clarified yet, it has been identified that it can favor matrix degradative processes when its activation is promoted by TGF-Beta by decreasing the stability of HDACs Di Giorgio, Hancock, and Brancolini (2018). Consistently with this scenario we observe an underexpression of the HDAC2 gene, which competes for binding to the same region of MEF2.

In the plot 2 of figure 19 we observe an overexpression of the HT receptor (HTR4) which is involved in the neuronal response with the serotonergic synapse pathway Arese et al. (2018). HT receptors have been shown to be over expressed in cancer tissues and that their antagonists inhibit the HT effect to different extents and induce apoptosis Radin and Patel (2017).

In plot 3 of figure 19 we can identify an overexpression of two Frizzled Receptors(FZD7 and FZD9). FZD9 expression was reported to be upregulated in different carcinomas Qiu et al. (2016). Moreover, a dysregulation of the WNT pathway by FZD7 was related to tumorigenesis and metastasis. As mentioned, a possible disregulation of the WNT pathway is induced by the Frizzled Receptors; consistently to this we observe that expression levels of WNT10a and WNT6 are affected as well in tumor patients. A desregulation in Wnt pathway has been associated with the accumulation of the oncogenic protein beta-catein, and therefore, with the development of cancer Fearon (2011).

In plot 4 of figure 19 we identify an overexpression of the PDGF-B gene.In a mice study was identified that PDGF-B released from colon tumor cells regulated tumor growth by inducing blood vessel formation. Also they found that an elevated expression of PDGF-B was also correlated with tumor size Hsu, Huang, and Friedman (1995).

In plot 5 of figure 19 we identify an overexpression of the CENPA gene. Recent work has demonstrated that the kinetochore protein CENP-A was overexpressed in all of 11 primary human colorectal cancer tissues. It is also known that chromosome missegregation during mitosis is the main cause of aneuploidy and contributes to oncogenesis. Centromere protein (CENP)-A is the centromere-specific histone-H3-like variant essential for centromere structure, function and the assembly of the kinetochore Tomonaga et al. (2003).

boxplotgenes <- function(se, gene) {
  iterations = dim(se)[2]
  variables = 2
  output <- matrix(ncol=variables, nrow=iterations)
  output <- data.frame(output)
  colnames(output) <- c("type", "logCPM")
  aa <- se[rowData(se)$symbol == gene]$type
  bb<-assays(se[rowData(se)$symbol == gene])$logCPM
  for(i in 1:iterations){
  output$type[i] <- aa[i]
  output$logCPM[i] <- bb[i]
  }
  output$type<-gsub(x = output$type, pattern = "1", replacement = "normal")
  output$type<-gsub(x = output$type, pattern = "2", replacement = "tumor")
  boxplot(logCPM ~ type, data=output, col=c("grey",rgb(0.7, 0.1, 0.3)), main=gene, ylab="logCPM")
}
par(mfrow = c(3, 3), mar= c(2,5,2,2), mai=c(0.5,0.6,0.6,1))
boxplotgenes(coadse.filt, "MEF2A")
boxplotgenes(coadse.filt, "HDAC2")
boxplotgenes(coadse.filt, "HTR4")
boxplotgenes(coadse.filt, "FZD7")
boxplotgenes(coadse.filt, "FZD9")
boxplotgenes(coadse.filt, "PDGFB")
boxplotgenes(coadse.filt, "CENPA")
Boxplot of logCPM expression of genes of interest between tumor and normal samples

Figure 20: Boxplot of logCPM expression of genes of interest between tumor and normal samples

In Figure 20 you can observe the boxplots for the logCPM expression values of the genes we just discussed between tumor and normal samples.

4.3 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ggplot2_3.1.1               calibrate_1.7.2            
 [3] MASS_7.3-51.4               GSVA_1.30.0                
 [5] GSVAdata_1.18.0             hgu95a.db_3.2.3            
 [7] GSEABase_1.44.0             org.Hs.eg.db_3.7.0         
 [9] xtable_1.8-4                GOstats_2.48.0             
[11] graph_1.60.0                Category_2.48.1            
[13] Matrix_1.2-17               geneplotter_1.60.0         
[15] annotate_1.60.1             XML_3.98-1.19              
[17] AnnotationDbi_1.44.0        lattice_0.20-38            
[19] edgeR_3.24.3                limma_3.38.3               
[21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[23] BiocParallel_1.16.6         matrixStats_0.54.0         
[25] Biobase_2.42.0              GenomicRanges_1.34.0       
[27] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[29] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[31] knitr_1.22                  BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2    
 [4] Rgraphviz_2.26.0       tools_3.5.3            R6_2.4.0              
 [7] DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1      
[10] withr_2.1.2            tidyselect_0.2.5       bit_1.1-14            
[13] compiler_3.5.3         bookdown_0.9           scales_1.0.0          
[16] genefilter_1.64.0      RBGL_1.58.2            stringr_1.4.0         
[19] digest_0.6.18          rmarkdown_1.12         AnnotationForge_1.24.0
[22] XVector_0.22.0         pkgconfig_2.0.2        htmltools_0.3.6       
[25] highr_0.8              rlang_0.3.4            RSQLite_2.1.1         
[28] shiny_1.3.2            dplyr_0.8.1            RCurl_1.95-4.12       
[31] magrittr_1.5           GO.db_3.7.0            GenomeInfoDbData_1.2.0
[34] Rcpp_1.0.1             munsell_0.5.0          stringi_1.4.3         
[37] yaml_2.2.0             zlibbioc_1.28.0        plyr_1.8.4            
[40] grid_3.5.3             blob_1.1.1             promises_1.0.1        
[43] crayon_1.3.4           splines_3.5.3          locfit_1.5-9.1        
[46] pillar_1.3.1           codetools_0.2-16       glue_1.3.1            
[49] evaluate_0.13          BiocManager_1.30.4     httpuv_1.5.1          
[52] gtable_0.3.0           purrr_0.3.2            assertthat_0.2.1      
[55] xfun_0.6               mime_0.6               later_0.8.0           
[58] survival_2.44-1.1      tibble_2.1.1           shinythemes_1.1.2     
[61] memoise_1.1.0         

5 Colorectal cancer Stages’ analysis

5.1 Quality assessment

The Quality assessment for tumor data will be performed following the steps and the reasoning used in the Quality assesment for the paired data. For this reason, here we are only going to highlight the strategies that differ from the previous pipeline.

5.1.1 Subset of samples: Tumor data

From the total of samples before starting to analyse the data, we proceed to do a subset that only includes the tumor data. The reason of doing this is because we are interested also in the differential expression analysis between the different stages of the Colon Adenocarcinoma Cancer.

In order to create the subset, we looked for those samples that had “tumor” as their type and discarded those with “normal” type. We also discarded those samples that had no information about the stage of the tumor.

Although the data is categorized in diverse subgroups, we will work with the following ones: stage I, stage II, stage III and stage IV, being each one the sum of all the subcategories that share the same stage number:

filter_tumor_mask <- rownames(colData(coadse)[colData(coadse)$type == "tumor",])
coadse.tumor <- coadse[, filter_tumor_mask]
coadse.tumor <- coadse.tumor[, !is.na(coadse.tumor$ajcc_pathologic_tumor_stage)]
coadse.tumor <- coadse.tumor[, coadse.tumor$ajcc_pathologic_tumor_stage != "[Not Available]"]
coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = "A", replacement = "")
coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = "B", replacement = "")
coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = "C", replacement = "")
table(coadse.tumor$ajcc_pathologic_tumor_stage)

  Stage I  Stage II Stage III  Stage IV 
       72       168       124        63 

5.1.2 Sequencing depth among tumor samples

First of all, we want to collect an overview of the data that we are working with. Starting examining the sequencing depth by plotting the total number of reads mapped to the genome per sample.

CM.4748 AA.3696 AA.3837 AA.A02H AA.A010 AA.A00E A6.2679 AA.3672 AA.3524 
    1.2     3.0     4.5     5.1     5.5     5.7     5.9     6.6     6.8 
AA.A01K AA.A004 AA.3947 AA.A01C CA.6717 D5.6536 AA.3976 AA.3506 AA.3841 
    7.3     7.4     7.6     8.0     8.0     8.1     8.3     8.6     8.6 
A6.2682 AA.A03F AA.3667 AA.3664 AA.A029 AA.3519 AA.3812 AA.3949 AA.3956 
    9.0     9.6     9.9    10.2    10.2    11.0    11.0    11.2    11.3 
AA.3842 AA.3848 AA.3875 AA.3930 AA.3955 AA.A01Q A6.3807 AA.3862 AA.3679 
   11.5    11.5    12.1    12.1    12.1    12.2    12.3    12.3    12.5 
A6.A5ZU AA.3684 AA.3542 AA.3850 AA.A00F A6.A56B AA.3560 AA.3680 AA.3715 
   12.6    12.7    12.8    12.8    13.0    13.1    13.2    13.2    13.2 
AA.A01F AA.A01T CM.4747 AA.3681 AZ.4313 AA.3495 AA.3531 AA.3521 AA.A01Z 
   13.4    13.4    13.4    13.5    13.5    13.6    13.6    13.7    13.7 
AA.3861 AA.3502 AA.3970 AA.A00N AA.3950 AA.A024 AA.3666 AA.3530 AA.3710 
   13.9    14.1    14.1    14.2    14.3    14.3    14.5    14.6    14.6 
AA.3517 AA.3821 AA.3972 AA.A00D AA.3971 AA.A00L A6.3808 AA.3544 AA.3548 
   14.7    14.7    14.7    14.7    14.9    14.9    15.0    15.0    15.0 
AA.3845 AA.3975 AA.3852 AA.A01G AZ.4315 AA.3552 AA.3939 AA.3982 AA.3815 
   15.0    15.0    15.1    15.1    15.1    15.2    15.2    15.2    15.3 
AA.A00R A6.A567 AA.3534 AA.3941 AA.3968 AA.3846 AA.3867 AA.3973 AA.A00K 
   15.3    15.4    15.4    15.4    15.4    15.5    15.5    15.6    15.7 
CM.6169 A6.2683 AA.A00J AA.A00U AA.A02E DM.A1DB AA.3811 AA.A00W AA.A03J 
   15.7    15.8    15.8    15.9    15.9    15.9    16.0    16.0    16.0 
AA.3833 AA.A017 AA.A00Q CM.4752 AA.A02F AA.3518 AA.3844 AA.3529 AA.3980 
   16.2    16.2    16.3    16.3    16.4    16.5    16.5    16.7    16.7 
AA.3520 AY.4070 AA.3522 AA.3554 AA.3860 AA.3543 AA.A00O AZ.4308 A6.2676 
   16.8    16.8    17.0    17.0    17.0    17.1    17.1    17.1    17.2 
AA.3516 AA.3989 AA.3858 AA.A00Z AY.4071 AA.3488 AA.3561 AA.3877 AA.A01V 
   17.2    17.2    17.3    17.3    17.3    17.4    17.4    17.4    17.4 
AA.3525 AA.3492 AA.3851 AA.A00A AA.3514 AA.3549 AA.3866 AA.3984 AA.3986 
   17.5    17.6    17.6    17.6    17.7    17.7    17.7    17.7    17.7 
AA.3994 AA.A022 AA.3870 AZ.4684 AA.3864 AA.3979 AA.A01X A6.2680 AA.3538 
   17.7    17.7    17.9    17.9    18.0    18.0    18.0    18.1    18.2 
AA.3556 AA.3695 AA.3814 AA.3509 AA.A01S CM.4746 AA.3869 AA.A01P AA.A02W 
   18.2    18.2    18.3    18.4    18.5    18.5    18.7    18.7    18.7 
AA.A02R CA.5256 G4.6304 AA.3510 AA.3527 AA.3952 AA.A01D AZ.4681 CM.5341 
   18.8    18.8    18.8    19.0    19.0    19.0    19.0    19.0    19.0 
AA.3553 A6.2685 AA.3562 AA.3855 CM.4750 AA.A01R AZ.4614 AA.3831 AA.A02J 
   19.1    19.2    19.2    19.2    19.2    19.4    19.6    19.7    19.7 
AA.A02O AA.3819 CK.4951 A6.4107 AA.3494 AA.3532 AA.3966 AA.3688 NH.A6GC 
   19.8    19.9    19.9    20.0    20.0    20.1    20.6    20.7    20.7 
AA.A01I AZ.4615 A6.2671 A6.2678 AA.3818 A6.2681 AA.3678 AA.3555 AA.3854 
   20.9    21.0    21.1    21.1    21.2    21.4    21.4    21.5    21.6 
4T.AA8H G4.6310 AA.3692 D5.6920 AA.3856 AA.3872 AA.3673 AA.3693 F4.6856 
   21.7    21.8    21.9    21.9    22.4    22.7    23.0    23.3    23.8 
G4.6306 F4.6806 D5.6924 CM.6677 CM.5861 AA.3697 CM.5864 D5.6533 CA.5796 
   23.8    24.3    24.5    25.1    25.4    25.6    25.8    26.3    26.6 
CM.5862 F4.6855 NH.A6GB G4.6323 A6.6141 CM.6166 F4.6808 G4.6320 AY.6196 
   26.8    26.9    27.0    27.4    28.0    28.0    28.2    28.3    29.0 
NH.A8F8 AZ.5407 A6.A566 D5.6929 DM.A288 CM.5860 D5.6927 CM.5349 G4.6309 
   29.0    29.1    29.2    29.3    29.4    29.8    29.8    29.9    29.9 
D5.5540 AZ.6598 CM.4743 F4.6805 CK.5912 G4.6298 D5.6930 G4.6315 A6.2675 
   30.0    30.2    30.2    30.3    30.4    30.6    30.8    30.8    31.0 
AA.3663 A6.6142 CM.6165 G4.6311 G4.6295 F4.6460 F4.6461 CM.6674 CK.5916 
   31.1    31.3    31.5    31.5    31.6    31.8    31.8    31.9    32.3 
G4.6321 AA.3662 DM.A280 G4.6293 CM.6167 G4.6588 AU.3779 CA.5255 QL.A97D 
   32.4    32.5    32.5    32.5    32.6    32.6    32.7    32.9    33.0 
A6.4105 AD.5900 AZ.4323 AZ.6607 D5.6530 F4.6463 AA.3655 D5.5537 F4.6807 
   33.2    33.2    33.2    33.2    33.2    33.2    33.5    33.5    33.5 
AA.3660 A6.5666 G4.6302 D5.6535 AZ.6599 RU.A8FL CK.5915 CK.6751 CM.6172 
   33.7    33.8    33.9    34.0    34.2    34.2    34.3    34.5    34.5 
4N.A93T AY.A8YK G4.6627 3L.AA1B AM.5821 AZ.6605 CM.5344 D5.6531 DM.A28K 
   34.6    34.6    34.6    34.7    34.7    34.7    34.9    34.9    34.9 
D5.5538 AD.6889 G4.6294 G4.6625 A6.6652 A6.6653 D5.6539 G4.6299 AD.A5EJ 
   35.3    35.4    35.4    35.5    35.6    35.6    35.6    35.7    36.3 
D5.6538 NH.A50V AZ.5403 CA.6719 CM.6171 DM.A28G G4.6626 A6.6137 D5.6540 
   36.3    36.3    36.4    36.4    36.5    36.6    36.6    36.7    36.8 
G4.6297 D5.6931 A6.6140 A6.6138 AZ.4616 DM.A28E CM.6675 CM.6680 A6.2686 
   36.9    37.0    37.1    37.3    37.4    37.5    37.6    37.7    37.8 
CM.6161 CM.4751 CM.6676 CK.6746 D5.5539 NH.A50T A6.5667 AM.5820 CM.5863 
   37.9    38.0    38.0    38.1    38.1    38.1    38.2    38.2    38.3 
G4.6586 A6.5660 AY.6386 D5.6926 DM.A282 AY.A54L AZ.6601 A6.5664 CK.4947 
   38.3    38.4    38.4    38.5    38.5    38.6    38.6    38.8    38.8 
AA.3526 CA.6716 A6.6654 A6.6782 DM.A0XD F4.6459 A6.5662 D5.6537 G4.6314 
   39.0    39.0    39.2    39.3    39.4    39.4    39.5    39.5    39.5 
D5.6932 DM.A1D4 G4.6628 DM.A28A AA.3496 AU.6004 D5.6532 AA.3685 CM.6163 
   39.6    39.6    39.6    39.7    39.8    39.8    39.8    39.9    39.9 
F4.6570 CM.6170 D5.6928 DM.A28C CM.5348 NH.A5IV AA.3511 AA.3712 F4.6809 
   39.9    40.0    40.0    40.1    40.2    40.2    40.3    40.4    40.4 
AA.A02K AZ.6606 AZ.6608 CK.4952 QG.A5YV QG.A5YX AY.6197 D5.6529 CM.6168 
   40.6    40.6    40.6    40.6    40.6    40.7    40.8    40.9    41.0 
DM.A28H CM.6162 CK.5914 AZ.6600 AA.3713 A6.6651 CA.5797 CM.6679 CA.6718 
   41.0    41.1    41.6    41.9    42.0    42.3    42.5    42.5    42.7 
CM.6678 AD.6548 AA.A02Y AD.A5EK AA.3675 CK.5913 D5.7000 DM.A1HA CA.6715 
   42.7    42.8    43.2    43.2    43.3    43.3    43.3    43.7    44.1 
D5.6922 QG.A5YW AY.A71X F4.6569 AZ.4682 AA.3489 D5.6898 CM.6164 DM.A1HB 
   44.2    44.3    44.4    44.5    44.6    45.2    45.2    45.3    45.9 
NH.A6GA AY.5543 CM.5868 G4.6322 D5.5541 CK.6748 QG.A5Z2 DM.A285 AD.6888 
   46.1    46.5    46.6    46.6    46.7    46.8    46.8    47.0    47.2 
F4.6854 G4.6303 DM.A28F AD.6899 CA.5254 CK.6747 DM.A1D7 DM.A28M A6.A565 
   47.3    47.9    48.0    48.2    48.2    48.3    48.4    48.5    48.6 
QG.A5Z1 D5.6541 AY.A69D DM.A1D9 WS.AB45 DM.A0XF CM.4744 NH.A50U DM.A1DA 
   48.9    49.0    49.1    49.4    49.4    50.6    50.7    50.7    50.9 
CK.4948 D5.6534 D5.6923 DM.A1D0 CK.4950 A6.6649 F4.6703 DM.A0X9 A6.5657 
   51.5    51.7    52.0    52.0    52.5    52.7    53.9    54.5    56.7 
G4.6307 DM.A1D6 A6.6648 F4.6704 
   56.9    62.7    63.1    67.5 
Library sizes in increasing order.

Figure 21: Library sizes in increasing order

Looking at the plot in figure 21, we can see two important details that we must take into consideration. The first one is that all four stages have different samples with different sequencing depth. The other one is that we observe that within the considerable differences in the sequencing depth of some samples, we can distinguish some samples on the left part of the graph that present considerably low sequencing depths: this may generate problems in further analysis and that is why we then wanted to remove these samples as shown below:

# Remove those samples with very low sequencing depth (<10)
coadse.tumor <- coadse.tumor[, dge.tumor$samples$lib.size >= 10*1e06]
dge.tumor <- DGEList(counts = assays(coadse.tumor)$counts, group = coadse.tumor$ajcc_pathologic_tumor_stage, genes = genes.tumor)
# coverage_filtered plot
par(mfrow = c(1, 1), mar = c(4, 5, 1, 1))
ord.tumor <- order(dge.tumor$sample$lib.size)
ggplot(as.data.frame(dge.tumor$samples), aes(x = row.names(dge.tumor$samples), y = dge.tumor$samples$lib.size[ord.tumor]/1000000, fill = dge.tumor$samples$group[ord.tumor])) + geom_bar(stat = "identity") + theme_bw() + theme(legend.title = element_blank(), axis.text.x=element_blank()) + ylab("Milions of reads") + xlab("Samples")
Library sizes in increasing order after the filtering

Figure 22: Library sizes in increasing order after the filtering

In this second plot(Figure 22) we can see the subset of data that we are going to use after having removed the samples with less coverage depth. In total,we have lost 21 samples from our dataset.

5.1.3 Gender among samples in tumor data

ord2 <- order(dge.tumor$sample$lib.size)
table(coadse.tumor$gender)

FEMALE   MALE 
   194    212 
barplot(dge.tumor$sample$lib.size[ord2]/1e+06, las = 1, ylab = "Millions of reads", xlab = "Samples",
        col = c("cyan", "orange")[coadse.tumor$gender[ord2]], border = NA)
legend("topleft", c("Female", "Male"), fill = c("cyan", "orange"), inset = 0.01)
Library sizes with respect to gender

Figure 23: Library sizes with respect to gender

In the figure 23 we can see that, as happened with the paired analysis, not only we have a similar number of female and male samples, but their sequencing depth is quite well equilibrated as we cannot clearly identify any defined clustering event.

5.1.4 Distribution of expression levels among samples in tumor data

Then, we first run a within sample normalization and we also explore the distribution of the expression levels through all the samples in terms of logarithmic CPM units.

Non-parametric density distribution of expression profiles per sample.

Figure 24: Non-parametric density distribution of expression profiles per sample

In figure 24 we can observe a non-parametric density distribution of expression profiles per sample which follows a bimodal distribution.

Non-parametric density distribution of expression profiles per sample.

Figure 25: Non-parametric density distribution of expression profiles per sample

To provide a clearer visualization, we decided to divide the dataset into four smaller subsets, one for each stage. Figure 25 shows the expression levels in terms of logCPM for the different stages of tumor samples separately.

As we do not observe extreme differences in the distribution of the expression values in the different tumor stages, we decide to not exclude any sample from the dataset in this step.

5.1.5 Distribution of expression levels among genes in tumor data

We now want to look if there is any gene which has very low expression to exclude them from the dataset.

Distribution of average expression level per gene.

Figure 26: Distribution of average expression level per gene

In order to do so, we have calculated the average expression of each gene for all the samples and plotted their distribution in Figure 26.

5.1.6 Filtering of lowly-expressed genes in tumor data

Those genes that have very low counts should be removed prior to further analysis because a gene must be expressed at some minimal level before it is likely to be translated into a protein and also because genes statistically they are very unlikely to be assessed as significantly differential expressed. So, such genes can therefore be removed from the analysis without any loss of information.

Then, to filter the lowly-expressed genes we calculate first a minimum CPM cutoff value of expression; 10 reads per million in the sample with lowest depth. We will keep also only those genes that have this minimum level of expression in at least as many samples as the smallest group of comparison.

cpmcutoff.tumor <- round(10/min(dge.tumor$sample$lib.size/1e+06), digits = 1)
cpmcutoff.tumor 
[1] 1
sprintf("Cutoff: %s", cpmcutoff.tumor)
[1] "Cutoff: 1"
nsamplescutoff.tumor <- min(table(coadse.tumor$ajcc_pathologic_tumor_stage))
nsamplescutoff.tumor
[1] 60
mask.tumor <- rowSums(cpm(dge.tumor) > cpmcutoff.tumor) >= nsamplescutoff.tumor
coadse.tumor.filt <- coadse.tumor[mask.tumor, ]
dge.tumor.filt <- dge.tumor[mask.tumor, ]

After the filtering of lowly-expressed genes, we have lost almost 6.500 genes.

dim(coadse.tumor)
[1] 20115   406
dim(coadse.tumor.filt)
[1] 13569   406

We can visually observe which genes have been left out from the datatset in Figure 7

par(mar = c(4, 5, 1, 1))
h <- hist(avgexp.tumor, xlab = expression("Expression level (" * log[2] * "CPM)"), main = "",
          las = 1, col = "grey", cex.axis = 1.2, cex.lab = 1.5)
x <- cut(rowMeans(assays(coadse.tumor.filt)$logCPM), breaks = h$breaks)
lines(h$mids, table(x), type = "h", lwd = 10, lend = 1, col = "darkred")
legend("topright", c("All genes", "Filtered genes"), fill = c("grey", "darkred"))
Distribution of average expression level per gene after the filtering

Figure 27: Distribution of average expression level per gene after the filtering

We can visually observe which genes have been left out from the datatset in Figure 27.

5.1.7 Normalization tumor data

To make gene expression values comparable across samples, normalization step is needed to continue with our analysis. We estimate a normalization factor for each library using the Trimmed Mean of M-Values and we apply it to our data.

dge.tumor.filt <- calcNormFactors(dge.tumor.filt) 
assays(coadse.tumor.filt)$logCPM <- cpm(dge.tumor.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
saveRDS(coadse.tumor.filt, file.path("results", "coadse.tumor.filt.rds"))
saveRDS(dge.tumor.filt, file.path("results", "dge.tumor.filt.rds"))

5.1.8 MA plots tumor data

We now want to visualize the expression profiles of the normalized data for each tumor stage.

For Stage I –>
MA-plots of the tumor samples in Stage I.

Figure 28: MA-plots of the tumor samples in Stage I

In Figure 28 we observe the MA-plots for the Tumor samples of stage I. We can see that even after the between and within normalization steps, we still have some artifacts in our data as we can observe for example in 28 A, C, F, AH and AT.

For Stage II –>
MA-plots of the tumor samples in Stage II.

Figure 29: MA-plots of the tumor samples in Stage II

In Figure 29 we observe the MA-plots for the Tumor samples of stage II. We can see that the majority of MA plots tend to follow the blue line with some exceptions like Figure 29 B, E, F, N, AO, DM, DR, EI and EX.

For Stage III –>
MA-plots of the tumor samples in Stage III.

Figure 30: MA-plots of the tumor samples in Stage III

In Figure 30 we observe the MA-plots for the Tumor samples of stage III. With the samples of stage III we see a similar situation compared to the previous MA-plots; we can see that the red line of the majority of plots tends to follow the blue line with some exceptions like Figure 30 J, K, AV, BB, BC, CT and DN.

For Stage IV –>
MA-plots of the tumor samples in Stage IV.

Figure 31: MA-plots of the tumor samples in Stage IV

Finally, in Figure 31 we observe the MA-plots for the Tumor samples of stage IV. In this case, we observe again a similar situation; there are some plots were we can detect some artifacts, like for example J, M, V, AD, AI and AZ.

5.1.9 Batch Identification for tumor data.

As normalization of the data can not always remove completely the batch effect, the next step will be indeed the search of potential surrogate of batch effect indicators derived from different elements of the TCGA barcode of each sample.

tss.tumor <- substr(colnames(dge.tumor.filt), 6, 7)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, TSS=tss.tumor))
           TSS
STAGE       3L 4N 4T A6 AA AD AM AU AY AZ CA CK CM D5 DM F4 G4 NH QG QL RU
  Stage I    1  0  0  2 31  3  0  1  3  3  0  4  9  5  0  4  2  0  1  1  0
  Stage II   0  0  1 12 63  2  1  1  2  5  8  4 12 13 15  6  9  2  1  0  0
  Stage III  0  1  0 15 37  2  0  0  3  4  1  4 10 11  8  5 10  3  3  0  1
  Stage IV   0  0  0  7 24  0  1  0  2  8  0  2  5  1  1  1  5  3  0  0  0
           TSS
STAGE       WS
  Stage I    0
  Stage II   1
  Stage III  0
  Stage IV   0
center.tumor <- substr(colnames(dge.tumor.filt), 27, 28)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, CENTER=center.tumor))
           CENTER
STAGE        07
  Stage I    70
  Stage II  158
  Stage III 118
  Stage IV   60
samplevial.tumor <- substr(colnames(dge.tumor.filt), 14, 16)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, SAMPLEVIAL=samplevial.tumor))
           SAMPLEVIAL
STAGE       01A 01B
  Stage I    70   0
  Stage II  158   0
  Stage III 117   1
  Stage IV   58   2
portionanalyte.tumor <- substr(colnames(dge.tumor.filt), 18, 20)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, PORTIONANALYTE=portionanalyte.tumor))
           PORTIONANALYTE
STAGE       01R 02R 03R 11R 12R 21R 23R 31R 33R 42R 43R 72R
  Stage I    28   6   0  31   2   1   1   0   0   0   1   0
  Stage II   68  12   0  59   4  14   0   0   0   1   0   0
  Stage III  33  14   0  51   0  18   0   2   0   0   0   0
  Stage IV   31   2   1  18   1   3   0   2   1   0   0   1
plate.tumor <- substr(colnames(dge.tumor.filt), 22, 25)
table(data.frame(STAGE=coadse.tumor.filt$ajcc_pathologic_tumor_stage, PLATE=plate.tumor))
           PLATE
STAGE       0821 0826 0905 1022 1113 1410 1653 1723 1774 1839 1873 1928 A002
  Stage I      5    1    3   10    1    4   10    5    4    6    1    2    3
  Stage II    10    3   12   15    1   11   13   14   13   16    0   12    6
  Stage III    6    2    5    9    2    5    9   22    6   12    2    5    5
  Stage IV     1    0    7    9    0    6    6    5    6    8    0    1    0
           PLATE
STAGE       A00A A083 A089 A155 A16W A180 A28H A32Y A32Z A37K A41B
  Stage I      1    1    1    0    1    0    3    1    4    2    1
  Stage II     5    1    0    8    6    0    3    3    2    2    2
  Stage III    1    4    1    3    3    1    8    2    1    4    0
  Stage IV     2    0    1    0    1    0    1    1    1    1    3

Too look for potential surrogates of batch effect indicators we need the maximum information explained by the possible batch generator, that means no zeros along the different groups. In order to be able to identify a possible batch effect and looking the previous tables, we decided to continue our analysis focusing in ‘plate’ information. Then we decided to see graphically how batch might be confounding the outcome of interest by doing a hierarchical clustering.

logCPM <- cpm(dge.tumor.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(plate.tumor))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(coadse.tumor.filt)
outcome <- paste(substr(colnames(coadse.tumor.filt), 9, 12), as.character(coadse.tumor.filt$ajcc_pathologic_tumor_stage), sep="-")
names(outcome) <- colnames(coadse.tumor.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(plate.tumor))), cex=0.75,  fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 10: Hierarchical clustering of the samples

As a result of this graph in figure 10, we concluded that the hierarchical clustering can become quite useless as a graphical tool when you have a huge quantity of samples, so then we decided to change our approach and try to identify graphically the batch effect with a MDS plot.

col.stage <- c("blue","cyan", "orange", "purple")[factor(coadse.tumor.filt$ajcc_pathologic_tumor_stage)]
batch <- as.integer(factor(plate.tumor))
plotMDS(dge.tumor.filt, dim=c(3,4), col=col.stage, pch= batch, cex=2)
legend("topleft", legend=levels(factor(coadse.tumor.filt$ajcc_pathologic_tumor_stage)), col=col.stage ,pch=16, cex= 0.75)
legend("topright", legend=levels(factor(plate.tumor)),pch= batch, cex=0.75, ncol = 2)
 MDS plot

Figure 32: MDS plot

With the MDSplot in figure 32 we are still not able to recognize any cluster of any group, so we can not conclude that ‘plate’ information could be a potential surrogate of batch effect.

5.1.10 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] geneplotter_1.60.0          annotate_1.60.1            
 [3] XML_3.98-1.19               AnnotationDbi_1.44.0       
 [5] lattice_0.20-38             ggplot2_3.1.1              
 [7] edgeR_3.24.3                limma_3.38.3               
 [9] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[11] BiocParallel_1.16.6         matrixStats_0.54.0         
[13] Biobase_2.42.0              GenomicRanges_1.34.0       
[15] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[17] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[19] knitr_1.22                  BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             locfit_1.5-9.1         assertthat_0.2.1      
 [4] digest_0.6.18          R6_2.4.0               plyr_1.8.4            
 [7] RSQLite_2.1.1          evaluate_0.13          highr_0.8             
[10] pillar_1.3.1           zlibbioc_1.28.0        rlang_0.3.4           
[13] lazyeval_0.2.2         blob_1.1.1             Matrix_1.2-17         
[16] rmarkdown_1.12         labeling_0.3           stringr_1.4.0         
[19] RCurl_1.95-4.12        bit_1.1-14             munsell_0.5.0         
[22] compiler_3.5.3         xfun_0.6               pkgconfig_2.0.2       
[25] htmltools_0.3.6        tidyselect_0.2.5       tibble_2.1.1          
[28] GenomeInfoDbData_1.2.0 bookdown_0.9           codetools_0.2-16      
[31] crayon_1.3.4           dplyr_0.8.1            withr_2.1.2           
[34] bitops_1.0-6           grid_3.5.3             xtable_1.8-4          
[37] gtable_0.3.0           DBI_1.0.0              magrittr_1.5          
[40] scales_1.0.0           KernSmooth_2.23-15     stringi_1.4.3         
[43] XVector_0.22.0         RColorBrewer_1.1-2     tools_3.5.3           
[46] bit64_0.9-7            glue_1.3.1             purrr_0.3.2           
[49] yaml_2.2.0             colorspace_1.4-1       BiocManager_1.30.4    
[52] memoise_1.1.0         

5.2 Differential expression

After having filtered and normalized our tumor dataset, we looked for differential expression of genes among the cancer stages. We applied a limma voom pipeline, the steps of which are explained in depth already in the paired DEanalysis section and for this reason here we only focus on the steps that differ in the new one. Since the tumor stage has 4 different values (“StageI”,“StageII”,“StageIII”, “StageIV”“), we decide to use a model with no intercept and a contrast matrix instead, following the instruction of the”Several groups" section of the limma user guide. After having obtained the model with the help of the contrast matrix, we applied eBayes. After multiple test adjusting, we are left with no DE genes.

coadse.tumor$ajcc_pathologic_tumor_stage <- gsub(x = coadse.tumor$ajcc_pathologic_tumor_stage, pattern = " ", replacement = "")
coadse.tumor$ajcc_pathologic_tumor_stage <- droplevels(as.factor(coadse.tumor$ajcc_pathologic_tumor_stage))
coadse.tumor$age_at_initial_pathologic_diagnosis <- droplevels(coadse.tumor$age_at_initial_pathologic_diagnosis)
lev <- c("StageI","StageII","StageIII", "StageIV")
f <- factor(coadse.tumor$ajcc_pathologic_tumor_stage, levels=lev)
design <- model.matrix(~0+f , colData(coadse.tumor))
colnames(design) <- lev
fit1 <- lmFit(assays(coadse.tumor)$logCPM,design)
contrast.matrix <- makeContrasts(StageI-StageII, StageI-StageIII, StageI-StageIV, StageII-StageIII, StageII-StageIV, StageIII-StageIV, levels=design)
fit2 <- contrasts.fit(fit1, contrast.matrix)
fit2 <- eBayes(fit2)
tt<-topTable(fit2, coef=1, n=Inf,adjust="fdr")
DEgenes <- rownames(tt)[tt$adj.P.Val < 0.1]
length(DEgenes)
[1] 0
saveRDS(tt, file.path("results", "tt.rds"))

This can be due to the fact that in many different biological contexts gene expression changes may be very little and after Multiple test adjusting no significant DE genes can be identified. We then decide to run a Principal Component Analysis to further explore the gene expression among the four cancer stages. Since we use the prcomp() function, we first need to transpose our matrix in order to have the samples as rows, if not we would get a pca analysis explaining how genes are related to each other, which in this case is not the desired output. After having prepared our data, we run the pca analyis and we can observe that the first four principal components explain respectively 11.7%, 7%, 6% and 4% of the variability.

a<-assays(coadse.tumor)
ta<-t(assays(coadse.tumor)$logCPM)
pca<- prcomp(ta)
summary(pca)
Importance of components:
                           PC1      PC2      PC3      PC4      PC5     PC6
Standard deviation     44.6935 36.51157 32.42333 28.17916 25.07691 20.6460
Proportion of Variance  0.1171  0.07818  0.06165  0.04657  0.03688  0.0250
Cumulative Proportion   0.1171  0.19531  0.25696  0.30353  0.34040  0.3654
                            PC7      PC8      PC9     PC10     PC11     PC12
Standard deviation     18.59050 17.71089 17.08951 16.53713 16.14581 14.85982
Proportion of Variance  0.02027  0.01839  0.01713  0.01604  0.01529  0.01295
Cumulative Proportion   0.38567  0.40406  0.42119  0.43723  0.45251  0.46546
                           PC13     PC14     PC15     PC16     PC17     PC18
Standard deviation     14.48627 13.96364 13.04648 12.84439 12.54125 12.17461
Proportion of Variance  0.01231  0.01143  0.00998  0.00967  0.00922  0.00869
Cumulative Proportion   0.47777  0.48920  0.49918  0.50886  0.51808  0.52677
                           PC19     PC20     PC21     PC22    PC23    PC24
Standard deviation     11.98712 11.63894 11.46548 10.98483 10.8444 10.6052
Proportion of Variance  0.00843  0.00794  0.00771  0.00708  0.0069  0.0066
Cumulative Proportion   0.53520  0.54314  0.55085  0.55793  0.5648  0.5714
                           PC25     PC26     PC27    PC28    PC29    PC30
Standard deviation     10.16535 10.06506 10.04708 9.83001 9.43738 9.30215
Proportion of Variance  0.00606  0.00594  0.00592 0.00567 0.00522 0.00507
Cumulative Proportion   0.57748  0.58342  0.58934 0.59501 0.60023 0.60530
                          PC31    PC32   PC33    PC34    PC35    PC36
Standard deviation     9.03161 8.98159 8.8583 8.70738 8.38520 8.26667
Proportion of Variance 0.00478 0.00473 0.0046 0.00445 0.00412 0.00401
Cumulative Proportion  0.61009 0.61482 0.6194 0.62387 0.62799 0.63200
                          PC37   PC38    PC39    PC40    PC41    PC42
Standard deviation     8.22396 8.1527 8.01166 7.95894 7.90294 7.71894
Proportion of Variance 0.00397 0.0039 0.00376 0.00371 0.00366 0.00349
Cumulative Proportion  0.63596 0.6399 0.64363 0.64734 0.65100 0.65450
                          PC43    PC44    PC45    PC46    PC47    PC48
Standard deviation     7.62120 7.48398 7.43333 7.37608 7.31295 7.25205
Proportion of Variance 0.00341 0.00328 0.00324 0.00319 0.00314 0.00308
Cumulative Proportion  0.65790 0.66119 0.66443 0.66762 0.67075 0.67384
                          PC49    PC50    PC51    PC52    PC53    PC54
Standard deviation     7.19878 7.13457 7.05670 6.99305 6.92780 6.85429
Proportion of Variance 0.00304 0.00298 0.00292 0.00287 0.00281 0.00276
Cumulative Proportion  0.67688 0.67986 0.68278 0.68565 0.68846 0.69122
                          PC55   PC56    PC57    PC58    PC59    PC60   PC61
Standard deviation     6.82387 6.7797 6.74418 6.66551 6.61843 6.58317 6.5330
Proportion of Variance 0.00273 0.0027 0.00267 0.00261 0.00257 0.00254 0.0025
Cumulative Proportion  0.69395 0.6966 0.69931 0.70192 0.70449 0.70703 0.7095
                          PC62    PC63    PC64    PC65    PC66   PC67
Standard deviation     6.46915 6.42760 6.42152 6.35342 6.29894 6.2584
Proportion of Variance 0.00245 0.00242 0.00242 0.00237 0.00233 0.0023
Cumulative Proportion  0.71199 0.71441 0.71683 0.71919 0.72152 0.7238
                          PC68    PC69   PC70    PC71    PC72    PC73
Standard deviation     6.21788 6.14443 6.1309 6.05797 6.02530 6.00753
Proportion of Variance 0.00227 0.00221 0.0022 0.00215 0.00213 0.00212
Cumulative Proportion  0.72608 0.72830 0.7305 0.73265 0.73478 0.73690
                          PC74    PC75    PC76    PC77    PC78    PC79
Standard deviation     5.95645 5.93414 5.91833 5.88620 5.81640 5.79733
Proportion of Variance 0.00208 0.00207 0.00205 0.00203 0.00198 0.00197
Cumulative Proportion  0.73898 0.74105 0.74310 0.74513 0.74712 0.74909
                          PC80    PC81    PC82   PC83    PC84    PC85
Standard deviation     5.77299 5.74475 5.72211 5.6907 5.64574 5.62126
Proportion of Variance 0.00195 0.00194 0.00192 0.0019 0.00187 0.00185
Cumulative Proportion  0.75104 0.75298 0.75490 0.7568 0.75866 0.76052
                          PC86    PC87    PC88    PC89    PC90    PC91
Standard deviation     5.60174 5.57095 5.54936 5.47291 5.46457 5.44289
Proportion of Variance 0.00184 0.00182 0.00181 0.00176 0.00175 0.00174
Cumulative Proportion  0.76236 0.76418 0.76598 0.76774 0.76949 0.77123
                          PC92   PC93    PC94    PC95    PC96    PC97
Standard deviation     5.43243 5.3871 5.36085 5.29807 5.29398 5.28114
Proportion of Variance 0.00173 0.0017 0.00169 0.00165 0.00164 0.00164
Cumulative Proportion  0.77296 0.7747 0.77635 0.77799 0.77964 0.78127
                          PC98    PC99   PC100   PC101   PC102   PC103
Standard deviation     5.23626 5.23255 5.20496 5.19151 5.18347 5.16841
Proportion of Variance 0.00161 0.00161 0.00159 0.00158 0.00158 0.00157
Cumulative Proportion  0.78288 0.78448 0.78607 0.78765 0.78923 0.79080
                         PC104   PC105   PC106   PC107   PC108   PC109
Standard deviation     5.15305 5.12863 5.12108 5.07225 5.06783 5.04423
Proportion of Variance 0.00156 0.00154 0.00154 0.00151 0.00151 0.00149
Cumulative Proportion  0.79235 0.79390 0.79543 0.79694 0.79845 0.79994
                         PC110   PC111   PC112   PC113   PC114   PC115
Standard deviation     5.01331 4.99007 4.98494 4.94455 4.92491 4.90752
Proportion of Variance 0.00147 0.00146 0.00146 0.00143 0.00142 0.00141
Cumulative Proportion  0.80141 0.80287 0.80433 0.80577 0.80719 0.80860
                        PC116  PC117   PC118   PC119   PC120   PC121   PC122
Standard deviation     4.8891 4.8850 4.86350 4.82485 4.82178 4.81539 4.78333
Proportion of Variance 0.0014 0.0014 0.00139 0.00137 0.00136 0.00136 0.00134
Cumulative Proportion  0.8100 0.8114 0.81279 0.81415 0.81552 0.81688 0.81822
                         PC123   PC124   PC125   PC126   PC127   PC128
Standard deviation     4.78080 4.76804 4.74201 4.74120 4.68750 4.68386
Proportion of Variance 0.00134 0.00133 0.00132 0.00132 0.00129 0.00129
Cumulative Proportion  0.81956 0.82089 0.82221 0.82353 0.82482 0.82610
                         PC129   PC130   PC131   PC132   PC133   PC134
Standard deviation     4.67484 4.66644 4.65354 4.60859 4.60377 4.59877
Proportion of Variance 0.00128 0.00128 0.00127 0.00125 0.00124 0.00124
Cumulative Proportion  0.82739 0.82866 0.82993 0.83118 0.83242 0.83366
                         PC135   PC136   PC137  PC138  PC139   PC140   PC141
Standard deviation     4.57334 4.56747 4.55318 4.5223 4.5179 4.49251 4.47413
Proportion of Variance 0.00123 0.00122 0.00122 0.0012 0.0012 0.00118 0.00117
Cumulative Proportion  0.83489 0.83611 0.83733 0.8385 0.8397 0.84091 0.84208
                         PC142   PC143   PC144   PC145   PC146   PC147
Standard deviation     4.45544 4.44353 4.43244 4.42651 4.41182 4.39015
Proportion of Variance 0.00116 0.00116 0.00115 0.00115 0.00114 0.00113
Cumulative Proportion  0.84324 0.84440 0.84555 0.84670 0.84784 0.84897
                         PC148   PC149   PC150   PC151  PC152  PC153   PC154
Standard deviation     4.37981 4.37028 4.35685 4.34626 4.3358 4.3228 4.30286
Proportion of Variance 0.00112 0.00112 0.00111 0.00111 0.0011 0.0011 0.00109
Cumulative Proportion  0.85010 0.85122 0.85233 0.85344 0.8545 0.8556 0.85672
                         PC155   PC156   PC157   PC158   PC159   PC160
Standard deviation     4.27687 4.27060 4.25577 4.25117 4.23543 4.22520
Proportion of Variance 0.00107 0.00107 0.00106 0.00106 0.00105 0.00105
Cumulative Proportion  0.85780 0.85887 0.85993 0.86099 0.86204 0.86309
                         PC161   PC162   PC163   PC164   PC165   PC166
Standard deviation     4.21232 4.18941 4.17341 4.16563 4.15160 4.14360
Proportion of Variance 0.00104 0.00103 0.00102 0.00102 0.00101 0.00101
Cumulative Proportion  0.86413 0.86516 0.86618 0.86720 0.86821 0.86921
                        PC167   PC168   PC169   PC170   PC171   PC172
Standard deviation     4.1327 4.10695 4.09876 4.07790 4.07162 4.06287
Proportion of Variance 0.0010 0.00099 0.00099 0.00098 0.00097 0.00097
Cumulative Proportion  0.8702 0.87120 0.87219 0.87316 0.87414 0.87510
                         PC173   PC174   PC175   PC176   PC177   PC178
Standard deviation     4.05339 4.04455 4.01090 4.00330 3.99674 3.98506
Proportion of Variance 0.00096 0.00096 0.00094 0.00094 0.00094 0.00093
Cumulative Proportion  0.87607 0.87703 0.87797 0.87891 0.87985 0.88078
                         PC179   PC180   PC181   PC182   PC183  PC184  PC185
Standard deviation     3.95933 3.95520 3.94865 3.93825 3.93104 3.9245 3.9157
Proportion of Variance 0.00092 0.00092 0.00091 0.00091 0.00091 0.0009 0.0009
Cumulative Proportion  0.88170 0.88262 0.88353 0.88444 0.88535 0.8862 0.8871
                         PC186   PC187   PC188   PC189   PC190   PC191
Standard deviation     3.89855 3.88968 3.88385 3.87187 3.86025 3.84428
Proportion of Variance 0.00089 0.00089 0.00088 0.00088 0.00087 0.00087
Cumulative Proportion  0.88804 0.88893 0.88981 0.89069 0.89156 0.89243
                         PC192   PC193   PC194   PC195   PC196   PC197
Standard deviation     3.83863 3.82117 3.81845 3.79988 3.79161 3.77460
Proportion of Variance 0.00086 0.00086 0.00086 0.00085 0.00084 0.00084
Cumulative Proportion  0.89329 0.89415 0.89501 0.89585 0.89670 0.89753
                         PC198   PC199   PC200   PC201   PC202   PC203
Standard deviation     3.76647 3.75734 3.74509 3.73608 3.73207 3.71334
Proportion of Variance 0.00083 0.00083 0.00082 0.00082 0.00082 0.00081
Cumulative Proportion  0.89836 0.89919 0.90001 0.90083 0.90165 0.90246
                         PC204  PC205  PC206   PC207   PC208   PC209   PC210
Standard deviation     3.70693 3.6999 3.6907 3.67838 3.67069 3.64806 3.63786
Proportion of Variance 0.00081 0.0008 0.0008 0.00079 0.00079 0.00078 0.00078
Cumulative Proportion  0.90326 0.9041 0.9049 0.90566 0.90645 0.90723 0.90800
                         PC211   PC212   PC213   PC214   PC215   PC216
Standard deviation     3.63668 3.62390 3.61442 3.60922 3.60190 3.58148
Proportion of Variance 0.00078 0.00077 0.00077 0.00076 0.00076 0.00075
Cumulative Proportion  0.90878 0.90955 0.91032 0.91108 0.91184 0.91259
                         PC217   PC218   PC219   PC220   PC221   PC222
Standard deviation     3.56900 3.56600 3.55709 3.55361 3.52974 3.51467
Proportion of Variance 0.00075 0.00075 0.00074 0.00074 0.00073 0.00072
Cumulative Proportion  0.91334 0.91409 0.91483 0.91557 0.91630 0.91702
                         PC223   PC224   PC225   PC226   PC227  PC228  PC229
Standard deviation     3.50518 3.49158 3.48292 3.47675 3.47090 3.4623 3.4546
Proportion of Variance 0.00072 0.00071 0.00071 0.00071 0.00071 0.0007 0.0007
Cumulative Proportion  0.91774 0.91846 0.91917 0.91988 0.92059 0.9213 0.9220
                         PC230   PC231   PC232   PC233   PC234   PC235
Standard deviation     3.42926 3.41371 3.40741 3.40331 3.39621 3.38637
Proportion of Variance 0.00069 0.00068 0.00068 0.00068 0.00068 0.00067
Cumulative Proportion  0.92268 0.92336 0.92404 0.92472 0.92540 0.92607
                         PC236   PC237   PC238   PC239   PC240   PC241
Standard deviation     3.38031 3.37654 3.37250 3.36566 3.34443 3.33955
Proportion of Variance 0.00067 0.00067 0.00067 0.00066 0.00066 0.00065
Cumulative Proportion  0.92674 0.92741 0.92808 0.92874 0.92940 0.93005
                         PC242   PC243   PC244   PC245   PC246   PC247
Standard deviation     3.33588 3.32950 3.31844 3.31419 3.30190 3.29236
Proportion of Variance 0.00065 0.00065 0.00065 0.00064 0.00064 0.00064
Cumulative Proportion  0.93070 0.93135 0.93200 0.93264 0.93328 0.93392
                         PC248   PC249   PC250   PC251   PC252   PC253
Standard deviation     3.28535 3.28141 3.26743 3.26422 3.25422 3.24564
Proportion of Variance 0.00063 0.00063 0.00063 0.00062 0.00062 0.00062
Cumulative Proportion  0.93455 0.93518 0.93581 0.93643 0.93705 0.93767
                         PC254   PC255  PC256  PC257  PC258   PC259   PC260
Standard deviation     3.23760 3.22960 3.2105 3.1997 3.1901 3.18130 3.17704
Proportion of Variance 0.00061 0.00061 0.0006 0.0006 0.0006 0.00059 0.00059
Cumulative Proportion  0.93829 0.93890 0.9395 0.9401 0.9407 0.94129 0.94189
                         PC261   PC262   PC263   PC264   PC265   PC266
Standard deviation     3.17478 3.16568 3.16051 3.15364 3.14908 3.12888
Proportion of Variance 0.00059 0.00059 0.00059 0.00058 0.00058 0.00057
Cumulative Proportion  0.94248 0.94306 0.94365 0.94423 0.94481 0.94539
                         PC267   PC268   PC269   PC270   PC271   PC272
Standard deviation     3.11436 3.11310 3.10912 3.10374 3.09464 3.08761
Proportion of Variance 0.00057 0.00057 0.00057 0.00056 0.00056 0.00056
Cumulative Proportion  0.94596 0.94653 0.94709 0.94766 0.94822 0.94878
                         PC273   PC274   PC275   PC276   PC277   PC278
Standard deviation     3.07420 3.07246 3.06774 3.04938 3.04524 3.03002
Proportion of Variance 0.00055 0.00055 0.00055 0.00055 0.00054 0.00054
Cumulative Proportion  0.94933 0.94989 0.95044 0.95098 0.95153 0.95207
                         PC279   PC280   PC281   PC282   PC283   PC284
Standard deviation     3.02888 3.02651 3.01281 3.00031 2.99465 2.98222
Proportion of Variance 0.00054 0.00054 0.00053 0.00053 0.00053 0.00052
Cumulative Proportion  0.95260 0.95314 0.95367 0.95420 0.95473 0.95525
                         PC285   PC286   PC287   PC288   PC289  PC290  PC291
Standard deviation     2.97629 2.97135 2.96061 2.95295 2.93863 2.9325 2.9254
Proportion of Variance 0.00052 0.00052 0.00051 0.00051 0.00051 0.0005 0.0005
Cumulative Proportion  0.95577 0.95629 0.95680 0.95731 0.95782 0.9583 0.9588
                        PC292   PC293   PC294   PC295   PC296   PC297
Standard deviation     2.9065 2.90285 2.89613 2.88550 2.88129 2.87175
Proportion of Variance 0.0005 0.00049 0.00049 0.00049 0.00049 0.00048
Cumulative Proportion  0.9593 0.95981 0.96030 0.96079 0.96128 0.96176
                         PC298   PC299   PC300   PC301   PC302   PC303
Standard deviation     2.86229 2.85615 2.84741 2.83469 2.82846 2.82636
Proportion of Variance 0.00048 0.00048 0.00048 0.00047 0.00047 0.00047
Cumulative Proportion  0.96224 0.96272 0.96320 0.96367 0.96414 0.96461
                         PC304   PC305   PC306   PC307   PC308   PC309
Standard deviation     2.81953 2.81194 2.80863 2.79527 2.79464 2.77689
Proportion of Variance 0.00047 0.00046 0.00046 0.00046 0.00046 0.00045
Cumulative Proportion  0.96507 0.96554 0.96600 0.96646 0.96692 0.96737
                         PC310   PC311   PC312   PC313   PC314   PC315
Standard deviation     2.77457 2.76934 2.75810 2.75248 2.74274 2.73410
Proportion of Variance 0.00045 0.00045 0.00045 0.00044 0.00044 0.00044
Cumulative Proportion  0.96782 0.96827 0.96871 0.96916 0.96960 0.97004
                         PC316   PC317   PC318   PC319   PC320   PC321
Standard deviation     2.72543 2.72232 2.71713 2.70336 2.69912 2.68869
Proportion of Variance 0.00044 0.00043 0.00043 0.00043 0.00043 0.00042
Cumulative Proportion  0.97047 0.97091 0.97134 0.97177 0.97220 0.97262
                         PC322   PC323   PC324   PC325   PC326   PC327
Standard deviation     2.68677 2.67612 2.67131 2.66771 2.65190 2.63964
Proportion of Variance 0.00042 0.00042 0.00042 0.00042 0.00041 0.00041
Cumulative Proportion  0.97304 0.97346 0.97388 0.97430 0.97471 0.97512
                         PC328   PC329  PC330  PC331  PC332  PC333   PC334
Standard deviation     2.62973 2.62841 2.6241 2.6181 2.6052 2.5954 2.59464
Proportion of Variance 0.00041 0.00041 0.0004 0.0004 0.0004 0.0004 0.00039
Cumulative Proportion  0.97553 0.97593 0.9763 0.9767 0.9771 0.9775 0.97793
                         PC335   PC336   PC337   PC338   PC339   PC340
Standard deviation     2.58933 2.57766 2.57635 2.56835 2.56263 2.54871
Proportion of Variance 0.00039 0.00039 0.00039 0.00039 0.00039 0.00038
Cumulative Proportion  0.97832 0.97871 0.97910 0.97948 0.97987 0.98025
                         PC341   PC342   PC343   PC344   PC345   PC346
Standard deviation     2.54729 2.53286 2.52395 2.52075 2.51463 2.50988
Proportion of Variance 0.00038 0.00038 0.00037 0.00037 0.00037 0.00037
Cumulative Proportion  0.98063 0.98101 0.98138 0.98175 0.98212 0.98249
                         PC347   PC348   PC349   PC350   PC351   PC352
Standard deviation     2.48787 2.47835 2.47456 2.46291 2.45721 2.45083
Proportion of Variance 0.00036 0.00036 0.00036 0.00036 0.00035 0.00035
Cumulative Proportion  0.98286 0.98322 0.98358 0.98393 0.98429 0.98464
                         PC353   PC354   PC355   PC356   PC357   PC358
Standard deviation     2.44010 2.43357 2.42909 2.42339 2.41580 2.41093
Proportion of Variance 0.00035 0.00035 0.00035 0.00034 0.00034 0.00034
Cumulative Proportion  0.98499 0.98533 0.98568 0.98602 0.98637 0.98671
                         PC359   PC360   PC361   PC362   PC363   PC364
Standard deviation     2.39708 2.39219 2.38403 2.37501 2.36474 2.36072
Proportion of Variance 0.00034 0.00034 0.00033 0.00033 0.00033 0.00033
Cumulative Proportion  0.98704 0.98738 0.98771 0.98804 0.98837 0.98870
                         PC365   PC366   PC367   PC368   PC369   PC370
Standard deviation     2.34682 2.34017 2.33101 2.32409 2.32219 2.31027
Proportion of Variance 0.00032 0.00032 0.00032 0.00032 0.00032 0.00031
Cumulative Proportion  0.98902 0.98934 0.98966 0.98998 0.99030 0.99061
                         PC371   PC372   PC373  PC374  PC375  PC376  PC377
Standard deviation     2.30664 2.29914 2.28848 2.2800 2.2738 2.2611 2.2561
Proportion of Variance 0.00031 0.00031 0.00031 0.0003 0.0003 0.0003 0.0003
Cumulative Proportion  0.99092 0.99123 0.99154 0.9918 0.9921 0.9925 0.9927
                        PC378   PC379   PC380   PC381   PC382   PC383
Standard deviation     2.2495 2.23862 2.23139 2.22381 2.20932 2.19898
Proportion of Variance 0.0003 0.00029 0.00029 0.00029 0.00029 0.00028
Cumulative Proportion  0.9930 0.99333 0.99363 0.99392 0.99420 0.99449
                         PC384   PC385   PC386   PC387   PC388   PC389
Standard deviation     2.19738 2.19201 2.17961 2.16815 2.16281 2.15704
Proportion of Variance 0.00028 0.00028 0.00028 0.00028 0.00027 0.00027
Cumulative Proportion  0.99477 0.99505 0.99533 0.99561 0.99588 0.99615
                         PC390   PC391   PC392   PC393   PC394   PC395
Standard deviation     2.13686 2.13183 2.11173 2.09438 2.08236 2.07396
Proportion of Variance 0.00027 0.00027 0.00026 0.00026 0.00025 0.00025
Cumulative Proportion  0.99642 0.99669 0.99695 0.99721 0.99746 0.99771
                         PC396   PC397   PC398   PC399   PC400   PC401
Standard deviation     2.03804 2.03039 2.01921 2.01246 2.01009 1.98283
Proportion of Variance 0.00024 0.00024 0.00024 0.00024 0.00024 0.00023
Cumulative Proportion  0.99796 0.99820 0.99844 0.99867 0.99891 0.99914
                         PC402   PC403   PC404   PC405     PC406
Standard deviation     1.95088 1.91258 1.90654 1.88289 3.267e-14
Proportion of Variance 0.00022 0.00021 0.00021 0.00021 0.000e+00
Cumulative Proportion  0.99936 0.99958 0.99979 1.00000 1.000e+00
dtp <- data.frame('Stages' = coadse.tumor$ajcc_pathologic_tumor_stage, pca$x[,1:2]) # the first two componets are selected (NB: you can also select 3 for 3D plottings or 3+)
ggplot(data = dtp) + geom_point(aes(x = PC1, y = PC2, col = Stages)) + theme_minimal() 
PC 1 and PC 2 by cancer Stage

Figure 33: PC 1 and PC 2 by cancer Stage

In Figure 33 it seems that there is no differential expression among the stages. However, we decide to proceed with the GSE Analysis. In fact, there may be small but consistent changes in gene expression values which could still imply relevant changes between the different stages.

5.2.1 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GSVA_1.30.0                 GSVAdata_1.18.0            
 [3] hgu95a.db_3.2.3             GSEABase_1.44.0            
 [5] org.Hs.eg.db_3.7.0          xtable_1.8-4               
 [7] GOstats_2.48.0              graph_1.60.0               
 [9] Category_2.48.1             Matrix_1.2-17              
[11] ggplot2_3.1.1               calibrate_1.7.2            
[13] MASS_7.3-51.4               geneplotter_1.60.0         
[15] annotate_1.60.1             XML_3.98-1.19              
[17] AnnotationDbi_1.44.0        lattice_0.20-38            
[19] edgeR_3.24.3                limma_3.38.3               
[21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[23] BiocParallel_1.16.6         matrixStats_0.54.0         
[25] Biobase_2.42.0              GenomicRanges_1.34.0       
[27] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[29] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[31] knitr_1.22                  BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2    
 [4] Rgraphviz_2.26.0       tools_3.5.3            R6_2.4.0              
 [7] DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1      
[10] withr_2.1.2            tidyselect_0.2.5       bit_1.1-14            
[13] compiler_3.5.3         labeling_0.3           bookdown_0.9          
[16] scales_1.0.0           genefilter_1.64.0      RBGL_1.58.2           
[19] stringr_1.4.0          digest_0.6.18          rmarkdown_1.12        
[22] AnnotationForge_1.24.0 XVector_0.22.0         pkgconfig_2.0.2       
[25] htmltools_0.3.6        highr_0.8              rlang_0.3.4           
[28] RSQLite_2.1.1          shiny_1.3.2            dplyr_0.8.1           
[31] RCurl_1.95-4.12        magrittr_1.5           GO.db_3.7.0           
[34] GenomeInfoDbData_1.2.0 Rcpp_1.0.1             munsell_0.5.0         
[37] stringi_1.4.3          yaml_2.2.0             zlibbioc_1.28.0       
[40] plyr_1.8.4             grid_3.5.3             blob_1.1.1            
[43] promises_1.0.1         crayon_1.3.4           splines_3.5.3         
[46] locfit_1.5-9.1         pillar_1.3.1           codetools_0.2-16      
[49] glue_1.3.1             evaluate_0.13          BiocManager_1.30.4    
[52] httpuv_1.5.1           gtable_0.3.0           purrr_0.3.2           
[55] assertthat_0.2.1       xfun_0.6               mime_0.6              
[58] later_0.8.0            survival_2.44-1.1      tibble_2.1.1          
[61] shinythemes_1.1.2      memoise_1.1.0         

5.3 GSEA

Here we ran the GSEA analysis for the tumor dataset. Same as for the DE Analysis, we decided not to explain again every step of the pipeline, which in this case were all explained in detail in the FE Analysis section of the paired dataset.

Same as for the paired dataset, we decided to restrict our analysis to the KEGG, REACTOME and BIOCARTA genesets. We then built the incidence matrix and we only selected the genesets which are present in both our dataset and the incidence matrix. Furthermore, we filtered out the genesets that contain less than 5 genes.

After that, we calculated the z-scores for the genesets as well as the adjusted p-values and focus on the ones with lower p-values.

# collect gene sets
data(c2BroadSets)
gsc <- GeneSetCollection(c2BroadSets)
gsc <- gsc[c(grep("^KEGG", names(gsc)),grep("^REACTOME", names(gsc)), grep("^BIOCARTA", names(gsc)))]
gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(coadse.tumor)$annotation))
# Know which gene is in which geneset
Im <- incidence(gsc)
# Discard genes that are not in our data
Im <- Im[, colnames(Im) %in% rownames(coadse.tumor)]
# Discard genes that are not in the dataset
coadse.tumor <- coadse.tumor[colnames(Im), ]
dge.tumor <- dge.tumor[colnames(Im),]
# Only select genesets of decent size
Im <- Im[rowSums(Im) >= 5, ]

# Z SCORES
# calc moderated test statistics
tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
# calc the z test statistic
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
#Look at the first few geneset
rnkGS <- sort(abs(zS), decreasing = TRUE)
pv <- pmin(pnorm(zS), 1 - pnorm(zS))
sum(pv < 0.5)
[1] 812
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.1)]
DEgs
 [1] "KEGG_OXIDATIVE_PHOSPHORYLATION"                                      
 [2] "REACTOME_AXON_GUIDANCE"                                              
 [3] "REACTOME_DIABETES_PATHWAYS"                                          
 [4] "REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING"                  
 [5] "REACTOME_FORMATION_OF_PLATELET_PLUG"                                 
 [6] "REACTOME_FURTHER_PLATELET_RELEASATE"                                 
 [7] "REACTOME_G_ALPHA_S_SIGNALLING_EVENTS"                                
 [8] "REACTOME_G_PROTEIN_BETA_GAMMA_SIGNALLING"                            
 [9] "REACTOME_HEMOSTASIS"                                                 
[10] "REACTOME_NCAM_SIGNALING_FOR_NEURITE_OUT_GROWTH"                      
[11] "REACTOME_NCAM1_INTERACTIONS"                                         
[12] "REACTOME_PLATELET_ACTIVATION"                                        
[13] "REACTOME_PLATELET_DEGRANULATION"                                     
[14] "REACTOME_REGULATION_OF_INSULIN_SECRETION"                            
[15] "REACTOME_SIGNALING_BY_PDGF"                                          
[16] "REACTOME_ADP_SIGNALLING_THROUGH_P2Y_PURINOCEPTOR_12"                 
[17] "REACTOME_ACTIVATION_OF_KAINATE_RECEPTORS_UPON_GLUTAMATE_BINDING"     
[18] "REACTOME_G_BETA_GAMMA_SIGNALLING_THROUGH_PLC_BETA"                   
[19] "REACTOME_G_BETA_GAMMA_SIGNALLING_THROUGH_PI3KGAMMA"                  
[20] "REACTOME_INHIBITION_OF_INSULIN_SECRETION_BY_ADRENALINE_NORADRENALINE"
[21] "REACTOME_SMOOTH_MUSCLE_CONTRACTION"                                  
[22] "BIOCARTA_SALMONELLA_PATHWAY"                                         

In the tumor-stages analysis there was not enough statistical power to detect DE genes in the different tumorogenic stages (I-IV), previosly to the gene set enrichment. However, after applying GSEA to the data we could see some relevant pathways that may differe between those stages.

Although many genes have been associated with the increased risk of CRC, the genetic differences across different stages of CRC have not been clearly identified Lorenc et al. (2017). Some genes have been reported to be potentially associated with higher stages of the cancer, as NEK4, RNF34 (implied in senescence and apoptosis) and NUDT6 (control signaling compounds and degradates potentially mutagenic oxidized nucleotides), which are expected to be in low concentration for higher stage of the disease.

For example, among the top 20 patways ranked according to the Z-score test, there are included those related to Homeostasis, the signaling pathway of PDGF, and the PI3K pathway. PDGF has been found to an important growth factor for normal tissue growth and division Manzat Saplacan et al. (2017), and also corelated with CRC invasion and metastasis when deregulated. Regarding to PI3K pathway, has also been related with loss of Adenomatous Polyposis Coli (APC), commented to be potentially important in CRC development.

However, we still miss some information that may be the reason for a non statistical reliable results. In our tumor stages study, some limitations are found: we only use TCGA CRC data on samples from cancer patients, and thus the analysis was only performed on these samples (difficulty to have a control); many field of the TCGA data were missing (NAs), which was not possible to be evaluated, and may be alterating gene expression in some of the genes of interest.

Finally, the last test we decided to perform to try to identy visually some kind of differences between the different stages was a boxplot of the different logCPM values between stages in those genes with highest logFC values.

boxplotgenes <- function(se, gene) {
  iterations = dim(se)[2]
  variables = 2
  output <- matrix(ncol=variables, nrow=iterations)
  output <- data.frame(output)
  colnames(output) <- c("stage", "logCPM")
  aa <- se[rowData(se)$symbol == gene]$ajcc_pathologic_tumor_stage
  bb<-assays(se[rowData(se)$symbol == gene])$logCPM
  for(i in 1:iterations){
  output$stage[i] <- aa[i]
  output$logCPM[i] <- bb[i]
  }
  output$stage<-gsub(x = output$stage, pattern = "1", replacement = "Stage I")
  output$stage<-gsub(x = output$stage, pattern = "2", replacement = "Stage II")
  output$stage<-gsub(x = output$stage, pattern = "3", replacement = "Stage III")
  output$stage<-gsub(x = output$stage, pattern = "4", replacement = "Stage IV")
  boxplot(logCPM ~ stage, data=output, col=c("grey","red", "pink", "cyan", main= gene , ylab="logCPM"))
}

all_DEgenes <- tt
all_DEgenes_sorted <- all_DEgenes[order(all_DEgenes$adj.P.Val),]
genes<-rownames(all_DEgenes_sorted)[13]
gene<-dge.tumor$genes$symbol[rownames(dge.tumor$genes) == genes]


boxplotgenes(coadse.tumor, gene)
Boxplot of RIPK3

Figure 20: Boxplot of RIPK3

Finally, in the figure 20 we have represented the different logCPM values among stages of RIPK3 gene, the thirteenth gene with a higher logFC value, and a gene that has been suggested as a potential predictive and prognostic marker in metastatic colon cancer Conev et al. (2019). Looking at the plot, we can not see important differences among stages but a subtle tendency of increasing the logCPM values stage to stage.

5.3.1 Session Information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GOstats_2.48.0              ggplot2_3.1.1              
 [3] calibrate_1.7.2             MASS_7.3-51.4              
 [5] Category_2.48.1             Matrix_1.2-17              
 [7] GSVA_1.30.0                 GSVAdata_1.18.0            
 [9] hgu95a.db_3.2.3             GSEABase_1.44.0            
[11] graph_1.60.0                org.Hs.eg.db_3.7.0         
[13] xtable_1.8-4                geneplotter_1.60.0         
[15] annotate_1.60.1             XML_3.98-1.19              
[17] AnnotationDbi_1.44.0        lattice_0.20-38            
[19] edgeR_3.24.3                limma_3.38.3               
[21] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[23] BiocParallel_1.16.6         matrixStats_0.54.0         
[25] Biobase_2.42.0              GenomicRanges_1.34.0       
[27] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[29] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[31] knitr_1.22                  BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2    
 [4] Rgraphviz_2.26.0       tools_3.5.3            R6_2.4.0              
 [7] DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1      
[10] withr_2.1.2            tidyselect_0.2.5       bit_1.1-14            
[13] compiler_3.5.3         bookdown_0.9           scales_1.0.0          
[16] genefilter_1.64.0      RBGL_1.58.2            stringr_1.4.0         
[19] digest_0.6.18          rmarkdown_1.12         AnnotationForge_1.24.0
[22] XVector_0.22.0         pkgconfig_2.0.2        htmltools_0.3.6       
[25] highr_0.8              rlang_0.3.4            RSQLite_2.1.1         
[28] shiny_1.3.2            dplyr_0.8.1            RCurl_1.95-4.12       
[31] magrittr_1.5           GO.db_3.7.0            GenomeInfoDbData_1.2.0
[34] Rcpp_1.0.1             munsell_0.5.0          stringi_1.4.3         
[37] yaml_2.2.0             zlibbioc_1.28.0        plyr_1.8.4            
[40] grid_3.5.3             blob_1.1.1             promises_1.0.1        
[43] crayon_1.3.4           splines_3.5.3          locfit_1.5-9.1        
[46] pillar_1.3.1           codetools_0.2-16       glue_1.3.1            
[49] evaluate_0.13          BiocManager_1.30.4     httpuv_1.5.1          
[52] gtable_0.3.0           purrr_0.3.2            assertthat_0.2.1      
[55] xfun_0.6               mime_0.6               later_0.8.0           
[58] survival_2.44-1.1      tibble_2.1.1           shinythemes_1.1.2     
[61] memoise_1.1.0         

References

Arese, Marco, Federico Bussolino, Margherita Pergolizzi, Laura Bizzozero, and Davide Pascal. 2018. “Tumor Progression: The Neuronal Input.” Annals of Translational Medicine 6 (5).

Conev, Nikolay V, Eleonora G Dimitrova, Margarita K Bogdanova, Yavor K Kashlov, Borislav G Chaushev, Dilyan P Petrov, Chavdar H Bachvarov, et al. 2019. “RIPK3 Expression as a Potential Predictive and Prognostic Marker in Metastatic Colon Cancer.” Clinical and Investigative Medicine (Online) 42 (1): E31–E38.

Di Giorgio, Eros, Wayne W Hancock, and Claudio Brancolini. 2018. “MEF2 and the Tumorigenic Process, Hic Sunt Leones.” Biochimica et Biophysica Acta (BBA)-Reviews on Cancer 1870 (2): 261–73.

Fearon, Eric R. 2011. “Molecular Genetics of Colorectal Cancer.” Annual Review of Pathology: Mechanisms of Disease 6: 479–507.

Hsu, Stephen, Fei Huang, and Eileen Friedman. 1995. “Platelet-Derived Growth Factor-B Increases Colon Cancer Cell Growth in Vivo by a Paracrine Effect.” Journal of Cellular Physiology 165 (2): 239–45.

Lorenc, Zbigniew, Dariusz Waniczek, Katarzyna Lorenc-Podgórska, Wiktor Krawczyk, Maciej Domagała, Mateusz Majewski, and Urszula Mazurek. 2017. “Profile of Expression of Genes Encoding Matrix Metallopeptidase 9 (Mmp9), Matrix Metallopeptidase 28 (Mmp28) and Timp Metallopeptidase Inhibitor 1 (Timp1) in Colorectal Cancer: Assessment of the Role in Diagnosis and Prognostication.” Medical Science Monitor: International Medical Journal of Experimental and Clinical Research 23: 1305.

Manzat Saplacan, Roberta M, Loredana Balacescu, Claudia Gherman, Romeo I Chira, Anca Craiu, Petru A Mircea, Cosmin Lisencu, and Ovidiu Balacescu. 2017. “The Role of Pdgfs and Pdgfrs in Colorectal Cancer.” Mediators of Inflammation 2017.

Qiu, Xia, Jianguo Jiao, Yidong Li, and Tian Tian. 2016. “Overexpression of Fzd7 Promotes Glioma Cell Proliferation by Upregulating Taz.” Oncotarget 7 (52): 85987.

Radin, Daniel P, and Parth Patel. 2017. “A Current Perspective on the Oncopreventive and Oncolytic Properties of Selective Serotonin Reuptake Inhibitors.” Biomedicine & Pharmacotherapy 87: 636–39.

Rahman, Mumtahena, Laurie K Jackson, W Evan Johnson, Dean Y Li, Andrea H Bild, and Stephen R Piccolo. 2015. “Alternative Preprocessing of Rna-Sequencing Data in the Cancer Genome Atlas Leads to Improved Analysis Results.” Bioinformatics 31 (22): 3666–72.

Rebecca L. Siegel MPH, PhD, Kimberly D. Miller MPH Ahmedin Jemal DVM. 2015. “Cancer Statistics, 2015.” CA: A Cancer Journal for Clinicians 65 (1): 5–29.

Tomonaga, Takeshi, Kazuyuki Matsushita, Seiko Yamaguchi, Tatsuya Oohashi, Hideaki Shimada, Takenori Ochiai, Kinya Yoda, and Fumio Nomura. 2003. “Overexpression and Mistargeting of Centromere Protein-a in Human Primary Colorectal Cancer.” Cancer Research 63 (13): 3511–6.